01_GettingStarted_SparkNLP(Scala)

Spark-NLP

Getting started in Databricks

How to use Spark-NLP library in Databricks

1- Right-click the Workspace folder where you want to store the library.

2- Select Create > Library.

3- Select where you would like to create the library in the Workspace, and open the Create Library dialog:
Markdown Monster icon

4- From the Source drop-down menu, select Maven Coordinate:
Markdown Monster icon

5- Now, all available Spark Packages are at your fingertips! Just search for JohnSnowLabs:spark-nlp:version where version stands for the library version such as: 1.8.3 or 2.0.0
Markdown Monster icon

6- Select spark-nlp package and we are good to go!

More info about how to use 3rd Party Libraries in Databricks

// Apache Spark 2.4.0 has been tested with the latest Spark-NLP
// Not tested yet on 2.4.3?
spark.version
res0: String = 2.4.0

The Library comes with several Artifacts

spark-nlp-2.0.7

Artifacts

tensorflow-1.12.0.jar
greex-1.0.jar
automaton-1.11-8.jar
protobuf-java-3.0.0-beta-3.jar
joda-time-2.10.2.jar
libtensorflow-1.12.0.jar
jsr305-3.0.1.jar
aws-java-sdk-1.7.4.jar
config-1.3.0.jar
httpclient-4.2.jar
libtensorflow_jni-1.12.0.jar
annotations-3.0.1.jar
commons-logging-1.1.1.jar
gson-2.3.jar
spark-nlp-2.0.7.jar
jcip-annotations-1.0.jar
slf4j-api-1.7.21.jar
httpcore-4.2.jar
rocksdbjni-5.17.2.jar
liblevenshtein-3.0.0.jar
protobuf-java-util-3.0.0-beta-3.jar
lombok-1.16.8.jar
trove4j-3.0.3.jar
fastutil-7.0.12.jar
// Let's import Spark-NLP 
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._

import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._
// Let's see a basic usage of pretrained pipelines
//import com.johnsnowlabs.nlp.pretrained.pipelines.en.BasicPipeline
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline

PretrainedPipeline("explain_document_ml").annotate("Please parse this sentence. Thanks")
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline res0: Map[String,Seq[String]] = Map(checked -> List(Please, parse, this, sentence, ., Thanks), document -> List(Please parse this sentence. Thanks), pos -> ArrayBuffer(VB, NN, DT, NN, ., NNS), lemmas -> List(Please, parse, this, sentence, ., Thanks), token -> List(Please, parse, this, sentence, ., Thanks), stems -> List(pleas, pars, thi, sentenc, ., thank), sentence -> List(Please parse this sentence., Thanks))
val annotations = PretrainedPipeline("explain_document_ml").annotate(Array("We are very happy about SparkNLP", "And this is just another sentence"))

annotations.foreach(println(_, "\n"))
(Map(checked -> List(We, are, very, happy, about, SparkNLP), document -> List(We are very happy about SparkNLP), pos -> ArrayBuffer(PRP, VBP, RB, JJ, IN, NNP), lemmas -> List(We, be, very, happy, about, SparkNLP), token -> List(We, are, very, happy, about, SparkNLP), stems -> List(we, ar, veri, happi, about, sparknlp), sentence -> List(We are very happy about SparkNLP)), ) (Map(checked -> List(And, this, is, just, another, sentence), document -> List(And this is just another sentence), pos -> ArrayBuffer(CC, DT, VBZ, RB, DT, NN), lemmas -> List(And, this, be, just, another, sentence), token -> List(And, this, is, just, another, sentence), stems -> List(and, thi, i, just, anoth, sentenc), sentence -> List(And this is just another sentence)), ) annotations: Array[Map[String,Seq[String]]] = Array(Map(checked -> List(We, are, very, happy, about, SparkNLP), document -> List(We are very happy about SparkNLP), pos -> ArrayBuffer(PRP, VBP, RB, JJ, IN, NNP), lemmas -> List(We, be, very, happy, about, SparkNLP), token -> List(We, are, very, happy, about, SparkNLP), stems -> List(we, ar, veri, happi, about, sparknlp), sentence -> List(We are very happy about SparkNLP)), Map(checked -> List(And, this, is, just, another, sentence), document -> List(And this is just another sentence), pos -> ArrayBuffer(CC, DT, VBZ, RB, DT, NN), lemmas -> List(And, this, be, just, another, sentence), token -> List(And, this, is, just, another, sentence), stems -> List(and, thi, i, just, anoth, sentenc), sentence -> List(And this is just another sentence)))
// How about annotating the entire DataFrame
import spark.implicits._

val data = Seq("hello, this is an example sentence").toDF("mainColumn")

PretrainedPipeline("explain_document_ml").annotate(data, "mainColumn").show()
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | text| document| sentence| token| checked| lemmas| stems| pos| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |hello, this is an...|[[document, 0, 33...|[[document, 0, 33...|[[token, 0, 4, he...|[[token, 0, 4, he...|[[token, 0, 4, he...|[[token, 0, 4, he...|[[pos, 0, 4, UH, ...| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ import spark.implicits._ data: org.apache.spark.sql.DataFrame = [mainColumn: string]

The following method AdvancedPipelines is not available any longer!

Homework Problem

Figure out what's wrong with this...

import com.johnsnowlabs.nlp.pretrained.pipelines.en.AdvancedPipeline

//Annotate with Advacedpipeline: This may take some time to download everything you need, but just for the first time :)
AdvancedPipeline().annotate("Please parse this sentence. Thanks")
AdvancedPipeline().annotate(Array("We are very happy about SparkNLP", "And this is just another sentence"))

notebook:1: error: object pipelines is not a member of package com.johnsnowlabs.nlp.pretrained import com.johnsnowlabs.nlp.pretrained.pipelines.en.AdvancedPipeline ^ notebook:5: error: not found: value AdvancedPipeline AdvancedPipeline().annotate(Array("We are very happy about SparkNLP", "And this is just another sentence")) ^ notebook:4: error: not found: value AdvancedPipeline AdvancedPipeline().annotate("Please parse this sentence. Thanks") ^
val data = Seq("hello, this is an example sentence").toDF("text")

AdvancedPipeline().annotate(data, "text").show

Command skipped

Exercises

Let's load the parsed data in out s3 bucket now.