Classification using Word2Vec
Word embeddings
Word embeddings map words to vectors of real numbers. Frequency analysis, which we did in a another notebook, is an example of this. There, the 1000 most common words in a collection of text words were mapped to a 1000-dimensional space using one-hot encoding, while the other words were sent to the zero vector. An array of words is mapped to the sum of the one-hot encoded vectors.
A more sophisticated word embedding is Word2Vec, which uses the skip-gram model and hierarchical softmax. The idea is to map words to the vector so that it predicts the other words around it well. We refer to Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality for details.
The practical difference is that Word2Vec maps every word to a non-zero vector, and that the output dimension can be chosen freely. Also, the embedding itself has to be trained before use, using some large collection of words. An array of words is mapped to the average of these words.
This case study uses the sex forums on Flashback and Familjeliv. The aim is to determine which forum a thread comes from by using the resulting word embeddings, using logistic regression.
Preamble
This section loads libraries and imports functions from another notebook.
// import required libraries
import org.apache.spark.ml.feature.{Word2Vec,Word2VecModel}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.Row
import org.apache.spark.ml.feature.RegexTokenizer
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.Row
import org.apache.spark.ml.feature.RegexTokenizer
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
/scalable-data-science/000_0-sds-3-x-projects/student-project-01_group-TheTwoCultures/01_load_data
Loading the data
To extract the data from the .xml-file we use get_dataset().
Scraping the data takes quite some time, so we also supply a second cell that loads saved results.
// process .xml-files
val file_name = "dbfs:/datasets/student-project-01/familjeliv/familjeliv-sexsamlevnad.xml"
val df = get_dataset(file_name)
val file_name2 = "dbfs:/datasets/student-project-01/flashback/flashback-sex.xml"
val df2 = get_dataset(file_name2)
// paths to saved dataframes
val file_path_familjeliv = "dbfs:/datasets/student-project-01/familjeliv/familjeliv-sexsamlevnad_df"
val file_path_flashback = "dbfs:/datasets/student-project-01/flashback/flashback-sex_df"
// load saved data frame
val df_familjeliv = load_df(file_path_familjeliv)
val df_flashback = load_df(file_path_flashback)
file_path_familjeliv: String = dbfs:/datasets/student-project-01/familjeliv/familjeliv-sexsamlevnad_df
file_path_flashback: String = dbfs:/datasets/student-project-01/flashback/flashback-sex_df
df_familjeliv: org.apache.spark.sql.DataFrame = [thread_id: string, thread_title: string ... 5 more fields]
df_flashback: org.apache.spark.sql.DataFrame = [thread_id: string, thread_title: string ... 5 more fields]
The dataframes consist of 7 fields: * threadid - a unique numerical signifier for each thread * threadtitle - the title of the thread, set by the person who created it * w - a comma separated string of all posts in a thread * forumid - a numerical forum signifier * forumtitle - name of the forum to which the thread belongs * platform - the platform from which the thread comes (flashback or familjeliv) * corpus_id - the corpus from which the data was gathered
Let's have a look at the dataframes.
display(df_familjeliv)
We add labels and merge the two dataframes.
val df = df_flashback.withColumn("c", lit(0.0)).union(df_familjeliv.withColumn("c", lit(1.0)))
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [thread_id: string, thread_title: string ... 6 more fields]
Preprocessing the data
Next, we must split and clean the text. For this we use Regex Tokenizers. We do not eliminate stop words.
// define the tokenizer
val tokenizer = new RegexTokenizer()
.setPattern("(?U),") // break by commas
.setMinTokenLength(5) // Filter away tokens with length < 5
.setInputCol("w") // name of the input column
.setOutputCol("text") // name of the output column
tokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_f0701eaf4f60
Let's tokenize and check out the result.
// define the thread title tokenizer
val df_tokenized = tokenizer.transform(df)
display(df_tokenized.select("w","text"))
Define and training a Word2Vec model
We use the text from the threads to train the Word2Vec model. First we define the model.
// define the model
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(200)
.setMinCount(0)
word2Vec: org.apache.spark.ml.feature.Word2Vec = w2v_945abe6fab57
We train the model by fitting it to any dataframe we wish. Here, we use the tokenized one. Training the model takes roughly 2h30m, so we save the result to avoid the hassle of redoing calculations.
// train it
val word2Vec_model = word2Vec.fit(df_tokenized)
// save it
word2Vec_model.save("dbfs:/datasets/student-project-01/word2vec_model_sex")
We can also load a saved model.
// load a saved model
val model = Word2VecModel.load("dbfs:/datasets/student-project-01/word2vec_model_sex")
model: org.apache.spark.ml.feature.Word2VecModel = w2v_854b46dceacc
import org.apache.spark.sql.functions.{col, concat_ws, udf, flatten, explode, collect_list, collect_set, lit}
import org.apache.spark.sql.types.{ArrayType, StructType, StructField, StringType, IntegerType}
import com.databricks.spark.xml._
import org.apache.spark.sql.functions._
read_xml: (file_name: String)org.apache.spark.sql.DataFrame
get_dataset: (file_name: String)org.apache.spark.sql.DataFrame
save_df: (df: org.apache.spark.sql.DataFrame, filePath: String)Unit
load_df: (filePath: String)org.apache.spark.sql.DataFrame
no_forums: (df: org.apache.spark.sql.DataFrame)Long
dbfs:/datasets/student-project-01/flashback/familjeliv-allmanna-ekonomi_df
familjeliv-allmanna-ekonomi_df already exists!
dbfs:/datasets/student-project-01/flashback/familjeliv-sexsamlevnad_df
familjeliv-sexsamlevnad_df already exists!
dbfs:/datasets/student-project-01/flashback/flashback-ekonomi_df
flashback-ekonomi_df already exists!
dbfs:/datasets/student-project-01/flashback/flashback-sex_df
flashback-sex_df already exists!
fl_root: String = dbfs:/datasets/student-project-01/familjeliv/
fb_root: String = dbfs:/datasets/student-project-01/flashback/
fl_data: Array[String] = Array(familjeliv-allmanna-ekonomi, familjeliv-sexsamlevnad)
fb_data: Array[String] = Array(flashback-ekonomi, flashback-sex)
Embedding using Word2Vec
Let's embedd the text and view the results.
// transform the text using the model
val embedded_text = model.transform(df_tokenized)
embedded_text: org.apache.spark.sql.DataFrame = [thread_id: string, thread_title: string ... 8 more fields]
Let's have a look!
display(embedded_text.select("c","result","text"))
Classification using Word2Vec
For classification we use logistic regression to compare with results from earlier. First we define the logistic regression model, using the same settings as before.
// Logistic regression
val logreg = new LogisticRegression()
.setLabelCol("c")
.setFeaturesCol("result")
.setMaxIter(100)
.setRegParam(0.0001)
.setElasticNetParam(0.5)
logreg: org.apache.spark.ml.classification.LogisticRegression = logreg_55ef614f783e
The easiest way to do the classification is to gather the tokenizer, Word2Vec and logistic regression into a pipeline.
val pipeline = new Pipeline().setStages(Array(tokenizer, word2Vec, logreg))
pipeline: org.apache.spark.ml.Pipeline = pipeline_2254409a91a2
Split the data into training and test data
val random_order = df.orderBy(rand())
val splits = random_order.randomSplit(Array(0.8, 0.2))
val training = splits(0)
val test = splits(1)
random_order: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [thread_id: string, thread_title: string ... 6 more fields]
splits: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Array([thread_id: string, thread_title: string ... 6 more fields], [thread_id: string, thread_title: string ... 6 more fields])
training: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [thread_id: string, thread_title: string ... 6 more fields]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [thread_id: string, thread_title: string ... 6 more fields]
Fit the model to the training data. This will take a while, so we make sure to save the result.
// fit the model to the training data
val logreg_model = pipeline.fit(training)
// save the model to filesystem
logreg_model.save("dbfs:/datasets/student-project-01/word2vec_logreg_model")
// load saved model
val loaded_model = PipelineModel.load("dbfs:/datasets/student-project-01/word2vec_logreg_model")
loaded_model: org.apache.spark.ml.PipelineModel = pipeline_25839437fccb
val predictions = loaded_model.transform(test).orderBy(rand())
predictions: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [thread_id: string, thread_title: string ... 11 more fields]
predictions.select("c","prediction","probability").show(30,false)
+---+----------+------------------------------------------+
|c |prediction|probability |
+---+----------+------------------------------------------+
|1.0|1.0 |[0.005855361335036372,0.9941446386649635] |
|1.0|1.0 |[0.2712120894396273,0.7287879105603726] |
|1.0|1.0 |[0.0017886928649958375,0.9982113071350042]|
|1.0|1.0 |[2.263165652125581E-4,0.9997736834347875] |
|0.0|1.0 |[0.42059601285820825,0.5794039871417918] |
|1.0|1.0 |[6.566687042616189E-4,0.9993433312957383] |
|0.0|0.0 |[0.7054463412596114,0.29455365874038864] |
|1.0|1.0 |[0.03103196407369915,0.9689680359263009] |
|0.0|0.0 |[0.9294663779954874,0.07053362200451263] |
|1.0|1.0 |[0.13974006800394764,0.8602599319960523] |
|1.0|1.0 |[0.08228914085494436,0.9177108591450557] |
|0.0|0.0 |[0.9788989176701534,0.021101082329846567] |
|0.0|0.0 |[0.9975070728891363,0.0024929271108637065]|
|1.0|1.0 |[0.0010781075297480556,0.998921892470252] |
|0.0|0.0 |[0.9253825302681451,0.07461746973185476] |
|1.0|1.0 |[0.01751495449884683,0.9824850455011531] |
|1.0|0.0 |[0.9864736560167631,0.013526343983237045] |
|1.0|1.0 |[0.002472519507918196,0.9975274804920817] |
|0.0|0.0 |[0.6174112612306129,0.3825887387693872] |
|0.0|0.0 |[0.7130899106721519,0.2869100893278482] |
|0.0|0.0 |[0.9263664682801233,0.07363353171987672] |
|0.0|0.0 |[0.9561455191484204,0.04385448085157954] |
|0.0|0.0 |[0.5835745861693306,0.41642541383066944] |
|1.0|1.0 |[0.4296249407516458,0.5703750592483542] |
|1.0|1.0 |[0.0032969395487662213,0.9967030604512337]|
|1.0|1.0 |[0.008645133666934816,0.9913548663330651] |
|1.0|1.0 |[4.1492836709996625E-5,0.9999585071632902]|
|0.0|1.0 |[0.43037903982909986,0.5696209601709002] |
|0.0|1.0 |[0.43707897641990706,0.562921023580093] |
|1.0|1.0 |[0.29846393214228517,0.7015360678577148] |
+---+----------+------------------------------------------+
only showing top 30 rows
val evaluator = new BinaryClassificationEvaluator().setLabelCol("c")
evaluator.evaluate(predictions)
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_0d518432d17a
res13: Double = 0.9499399325125091
An AUCROC of 0.95 is good, but not notably better than the other, conceptually simpler model. More is not always better!
Previously, we classified entire threads. Let's see if it works as well on thread titles.
val df_threads = df.select("c","thread_title").withColumnRenamed("thread_title","w")
val evaluation = loaded_model.transform(df_threads).orderBy(rand())
evaluator.evaluate(evaluation)
df_threads: org.apache.spark.sql.DataFrame = [c: double, w: string]
evaluation: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [c: double, w: string ... 5 more fields]
res14: Double = 0.5261045467524708
This did not work at all. It is essentially equivalent to guessing randomly. Thread titles contain only a few words, so this is not surprising.
Note: the same model was used as for both classifying tasks. Since thread titles were not part of the threads, the entire dataset could conceivably be used for training. Whether or not this would improve results is unclear.