035_LDA_CornellMovieDialogs(Scala)

Topic Modeling of Movie Dialogs with Latent Dirichlet Allocation

Let us cluster the conversations from different movies!

This notebook will provide a brief algorithm summary, links for further reading, and an example of how to use LDA for Topic Modeling.

not tested in Spark 2.2 yet (see 034 notebook for syntactic issues, if any)

Algorithm Summary

Readings for LDA

Also read the methodological and more formal papers cited in the above links if you want to know more.

Let's get a bird's eye view of LDA from http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf next.

  • See pictures (hopefully you read the paper last night!)
  • Algorithm of the generative model (this is unsupervised clustering)
  • For a careful introduction to the topic see Section 27.3 and 27.4 (pages 950-970) pf Murphy's Machine Learning: A Probabilistic Perspective, MIT Press, 2012.
  • We will be quite application focussed or applied here!
Show code
Show code
Show code
Show code

Probabilistic Topic Modeling Example

This is an outline of our Topic Modeling workflow. Feel free to jump to any subtopic to find out more.

  • Step 0. Dataset Review
  • Step 1. Downloading and Loading Data into DBFS
    • (Step 1. only needs to be done once per shard - see details at the end of the notebook for Step 1.)
  • Step 2. Loading the Data and Data Cleaning
  • Step 3. Text Tokenization
  • Step 4. Remove Stopwords
  • Step 5. Vector of Token Counts
  • Step 6. Create LDA model with Online Variational Bayes
  • Step 7. Review Topics
  • Step 8. Model Tuning - Refilter Stopwords
  • Step 9. Create LDA model with Expectation Maximization
  • Step 10. Visualize Results

Step 0. Dataset Review

In this example, we will use the Cornell Movie Dialogs Corpus.

Here is the README.txt:



Cornell Movie-Dialogs Corpus

Distributed together with:

"Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs" Cristian Danescu-Niculescu-Mizil and Lillian Lee Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.

(this paper is included in this zip file)

NOTE: If you have results to report on these corpora, please send email to cristian@cs.cornell.edu or llee@cs.cornell.edu so we can add you to our list of people using this data. Thanks!

Contents of this README:

    A) Brief description
    B) Files description
    C) Details on the collection procedure
    D) Contact

A) Brief description:

This corpus contains a metadata-rich collection of fictional conversations extracted from raw movie scripts:

  • 220,579 conversational exchanges between 10,292 pairs of movie characters
  • involves 9,035 characters from 617 movies
  • in total 304,713 utterances
  • movie metadata included:
      - genres
      - release year
      - IMDB rating
      - number of IMDB votes
      - IMDB rating
    
  • character metadata included:
      - gender (for 3,774 characters)
      - position on movie credits (3,321 characters)
    

B) Files description:

In all files the field separator is " +++$+++ "

  • movie_titles_metadata.txt

      - contains information about each movie title
      - fields:
              - movieID,
              - movie title,
              - movie year,
              - IMDB rating,
              - no. IMDB votes,
              - genres in the format ['genre1','genre2',...,'genreN']
    
  • movie_characters_metadata.txt

      - contains information about each movie character
      - fields:
              - characterID
              - character name
              - movieID
              - movie title
              - gender ("?" for unlabeled cases)
              - position in credits ("?" for unlabeled cases)
    
  • movie_lines.txt

      - contains the actual text of each utterance
      - fields:
              - lineID
              - characterID (who uttered this phrase)
              - movieID
              - character name
              - text of the utterance
    
  • movie_conversations.txt

      - the structure of the conversations
      - fields
              - characterID of the first character involved in the conversation
              - characterID of the second character involved in the conversation
              - movieID of the movie in which the conversation occurred
              - list of the utterances that make the conversation, in chronological
                      order: ['lineID1','lineID2',...,'lineIDN']
                      has to be matched with movie_lines.txt to reconstruct the actual content
    
  • raw_script_urls.txt

      - the urls from which the raw sources were retrieved
    

C) Details on the collection procedure:

We started from raw publicly available movie scripts (sources acknowledged in raw_script_urls.txt). In order to collect the metadata necessary for this study and to distinguish between two script versions of the same movie, we automatically matched each script with an entry in movie database provided by IMDB (The Internet Movie Database; data interfaces available at http://www.imdb.com/interfaces). Some amount of manual correction was also involved. When more than one movie with the same title was found in IMBD, the match was made with the most popular title (the one that received most IMDB votes)

After discarding all movies that could not be matched or that had less than 5 IMDB votes, we were left with 617 unique titles with metadata including genre, release year, IMDB rating and no. of IMDB votes and cast distribution. We then identified the pairs of characters that interact and separated their conversations automatically using simple data processing heuristics. After discarding all pairs that exchanged less than 5 conversational exchanges there were 10,292 left, exchanging 220,579 conversational exchanges (304,713 utterances). After automatically matching the names of the 9,035 involved characters to the list of cast distribution, we used the gender of each interpreting actor to infer the fictional gender of a subset of 3,321 movie characters (we raised the number of gendered 3,774 characters through manual annotation). Similarly, we collected the end credit position of a subset of 3,321 characters as a proxy for their status.

D) Contact:

Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil)



Step 2. Loading the Data and Data Cleaning

We have already used the wget command to download the file, and put it in our distributed file system (this process takes about 1 minute). To repeat these steps or to download data from another source follow the steps at the bottom of this worksheet on Step 1. Downloading and Loading Data into DBFS.

Let's make sure these files are in dbfs now:

// this is where the data resides in dbfs (see below to download it first, if you go to a new shard!)
display(dbutils.fs.ls("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/")) 
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/README.txtREADME.txt4181
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_characters_metadata.txtmovie_characters_metadata.txt705695
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txtmovie_conversations.txt6760930
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txtmovie_lines.txt34641919
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txtmovie_titles_metadata.txt67289
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/raw_script_urls.txtraw_script_urls.txt56177

Conversations Data

sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt").top(5).foreach(println)
u999 +++$+++ u1006 +++$+++ m65 +++$+++ ['L227588', 'L227589', 'L227590', 'L227591', 'L227592', 'L227593', 'L227594', 'L227595', 'L227596'] u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228159', 'L228160'] u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228157', 'L228158'] u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228130', 'L228131'] u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228127', 'L228128', 'L228129']
// Load text file, leave out file paths, convert all strings to lowercase
val conversationsRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt").zipWithIndex()
conversationsRaw: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[3709] at zipWithIndex at command-753740454082219:2

Review first 5 lines to get a sense for the data format.

conversationsRaw.top(5).foreach(println) // the first five Strings in the RDD
(u999 +++$+++ u1006 +++$+++ m65 +++$+++ ['L227588', 'L227589', 'L227590', 'L227591', 'L227592', 'L227593', 'L227594', 'L227595', 'L227596'],8954) (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228159', 'L228160'],8952) (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228157', 'L228158'],8951) (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228130', 'L228131'],8950) (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228127', 'L228128', 'L228129'],8949)
conversationsRaw.count // there are over 83,000 conversations in total
res1: Long = 83097
import scala.util.{Failure, Success}

val regexConversation = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(\[.*\]\s*$)""".r

case class conversationLine(a: String, b: String, c: String, d: String)

val conversationsRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt")
 .zipWithIndex()
  .map(x => 
          {
            val id:Long = x._2
            val line = x._1
            val pLine = regexConversation.findFirstMatchIn(line)
                               .map(m => conversationLine(m.group(1), m.group(3), m.group(5), m.group(7))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              (id,pLine)
           }
  )
import scala.util.{Failure, Success} regexConversation: scala.util.matching.Regex = \s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(\[.*\]\s*$) defined class conversationLine conversationsRaw: org.apache.spark.rdd.RDD[(Long, Product with Serializable with scala.util.Try[conversationLine])] = MapPartitionsRDD[3713] at map at command-753740454082223:9
conversationsRaw.filter(x => x._2.isSuccess).count()
res2: Long = 83097
conversationsRaw.filter(x => x._2.isFailure).count()
res3: Long = 0

The conversation number and line numbers of each conversation are in one line in conversationsRaw.

conversationsRaw.filter(x => x._2.isSuccess).take(5).foreach(println)
(0,Success(conversationLine(u0,u2,m0,['L194', 'L195', 'L196', 'L197']))) (1,Success(conversationLine(u0,u2,m0,['L198', 'L199']))) (2,Success(conversationLine(u0,u2,m0,['L200', 'L201', 'L202', 'L203']))) (3,Success(conversationLine(u0,u2,m0,['L204', 'L205', 'L206']))) (4,Success(conversationLine(u0,u2,m0,['L207', 'L208'])))

Let's create conversations that have just the coversation id and line-number with order information.

val conversations 
    = conversationsRaw
      .filter(x => x._2.isSuccess)
      .flatMap { 
        case (id,Success(l))  
                  => { val conv = l.d.replace("[","").replace("]","").replace("'","").replace(" ","")
                       val convLinesIndexed = conv.split(",").zipWithIndex
                       convLinesIndexed.map( cLI => (id, cLI._2, cLI._1))
                      }
       }.toDF("conversationID","intraConversationID","lineID")
notebook:4: warning: match may not be exhaustive. It would fail on the following input: (_, Failure(_)) .flatMap { ^ conversations: org.apache.spark.sql.DataFrame = [conversationID: bigint, intraConversationID: int ... 1 more field]
conversations.show(15)
+--------------+-------------------+------+ |conversationID|intraConversationID|lineID| +--------------+-------------------+------+ | 0| 0| L194| | 0| 1| L195| | 0| 2| L196| | 0| 3| L197| | 1| 0| L198| | 1| 1| L199| | 2| 0| L200| | 2| 1| L201| | 2| 2| L202| | 2| 3| L203| | 3| 0| L204| | 3| 1| L205| | 3| 2| L206| | 4| 0| L207| | 4| 1| L208| +--------------+-------------------+------+ only showing top 15 rows

Movie Titles

val moviesMetaDataRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
moviesMetaDataRaw.top(5).foreach(println)
m99 +++$+++ indiana jones and the temple of doom +++$+++ 1984 +++$+++ 7.50 +++$+++ 112054 +++$+++ ['action', 'adventure'] m98 +++$+++ indiana jones and the last crusade +++$+++ 1989 +++$+++ 8.30 +++$+++ 174947 +++$+++ ['action', 'adventure', 'thriller', 'action', 'adventure', 'fantasy'] m97 +++$+++ independence day +++$+++ 1996 +++$+++ 6.60 +++$+++ 151698 +++$+++ ['action', 'adventure', 'sci-fi', 'thriller'] m96 +++$+++ invaders from mars +++$+++ 1953 +++$+++ 6.40 +++$+++ 2115 +++$+++ ['horror', 'sci-fi'] m95 +++$+++ i am legend +++$+++ 2007 +++$+++ 7.10 +++$+++ 156084 +++$+++ ['drama', 'sci-fi', 'thriller'] moviesMetaDataRaw: org.apache.spark.rdd.RDD[String] = dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt MapPartitionsRDD[3722] at textFile at command-753740454082232:1
moviesMetaDataRaw.count() // number of movies
res8: Long = 617
import scala.util.{Failure, Success}

/*  - contains information about each movie title
  - fields:
          - movieID,
          - movie title,
          - movie year,
          - IMDB rating,
          - no. IMDB votes,
          - genres in the format ['genre1','genre2',...,'genreN']
          */
val regexMovieMetaData = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(\[.*\]\s*$)""".r

case class lineInMovieMetaData(movieID: String, movieTitle: String, movieYear: String, IMDBRating: String, NumIMDBVotes: String, genres: String)

val moviesMetaDataRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
  .map(line => 
          {
            val pLine = regexMovieMetaData.findFirstMatchIn(line)
                               .map(m => lineInMovieMetaData(m.group(1), m.group(3), m.group(5), m.group(7), m.group(9), m.group(11))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              pLine
           }
  )
import scala.util.{Failure, Success} regexMovieMetaData: scala.util.matching.Regex = \s*(\w+)\s+(\+{3}\$\+{3})\s*(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(\[.*\]\s*$) defined class lineInMovieMetaData moviesMetaDataRaw: org.apache.spark.rdd.RDD[Product with Serializable with scala.util.Try[lineInMovieMetaData]] = MapPartitionsRDD[3725] at map at command-753740454082234:17
moviesMetaDataRaw.count
res9: Long = 617
moviesMetaDataRaw.filter(x => x.isSuccess).count()
res10: Long = 617
moviesMetaDataRaw.filter(x => x.isSuccess).take(10).foreach(println)
Success(lineInMovieMetaData(m0,10 things i hate about you,1999,6.90,62847,['comedy', 'romance'])) Success(lineInMovieMetaData(m1,1492: conquest of paradise,1992,6.20,10421,['adventure', 'biography', 'drama', 'history'])) Success(lineInMovieMetaData(m2,15 minutes,2001,6.10,25854,['action', 'crime', 'drama', 'thriller'])) Success(lineInMovieMetaData(m3,2001: a space odyssey,1968,8.40,163227,['adventure', 'mystery', 'sci-fi'])) Success(lineInMovieMetaData(m4,48 hrs.,1982,6.90,22289,['action', 'comedy', 'crime', 'drama', 'thriller'])) Success(lineInMovieMetaData(m5,the fifth element,1997,7.50,133756,['action', 'adventure', 'romance', 'sci-fi', 'thriller'])) Success(lineInMovieMetaData(m6,8mm,1999,6.30,48212,['crime', 'mystery', 'thriller'])) Success(lineInMovieMetaData(m7,a nightmare on elm street 4: the dream master,1988,5.20,13590,['fantasy', 'horror', 'thriller'])) Success(lineInMovieMetaData(m8,a nightmare on elm street: the dream child,1989,4.70,11092,['fantasy', 'horror', 'thriller'])) Success(lineInMovieMetaData(m9,the atomic submarine,1959,4.90,513,['sci-fi', 'thriller']))
//moviesMetaDataRaw.filter(x => x.isFailure).take(10).foreach(println) // to regex refine for casting
val moviesMetaData 
    = moviesMetaDataRaw
      .filter(x => x.isSuccess)
      .map { case Success(l) => l }
      .toDF().select("movieID","movieTitle","movieYear")
notebook:4: warning: match may not be exhaustive. It would fail on the following input: Failure(_) .map { case Success(l) => l } ^ moviesMetaData: org.apache.spark.sql.DataFrame = [movieID: string, movieTitle: string ... 1 more field]
moviesMetaData.show(10,false)
+-------+---------------------------------------------+---------+ |movieID|movieTitle |movieYear| +-------+---------------------------------------------+---------+ |m0 |10 things i hate about you |1999 | |m1 |1492: conquest of paradise |1992 | |m2 |15 minutes |2001 | |m3 |2001: a space odyssey |1968 | |m4 |48 hrs. |1982 | |m5 |the fifth element |1997 | |m6 |8mm |1999 | |m7 |a nightmare on elm street 4: the dream master|1988 | |m8 |a nightmare on elm street: the dream child |1989 | |m9 |the atomic submarine |1959 | +-------+---------------------------------------------+---------+ only showing top 10 rows

Lines Data

val linesRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")
linesRaw: org.apache.spark.rdd.RDD[String] = dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt MapPartitionsRDD[3733] at textFile at command-753740454082242:1
linesRaw.count() // number of lines making up the conversations
res15: Long = 304713

Review first 5 lines to get a sense for the data format.

linesRaw.top(5).foreach(println)
L99999 +++$+++ u4166 +++$+++ m278 +++$+++ DULANEY +++$+++ You didn't know about it before that? L99998 +++$+++ u4168 +++$+++ m278 +++$+++ JOANNE +++$+++ To show you this. It's a letter from that lawyer, Koehler. He wrote it to me the day after I saw him. He's the one who told me I could get the money if Miss Lawson went to jail. L99997 +++$+++ u4166 +++$+++ m278 +++$+++ DULANEY +++$+++ Why'd you come here? L99996 +++$+++ u4168 +++$+++ m278 +++$+++ JOANNE +++$+++ I'm gonna go to jail. I know they're gonna make it look like I did it. They gotta put it on someone. L99995 +++$+++ u4168 +++$+++ m278 +++$+++ JOANNE +++$+++ What do you think I've got? A gun? Maybe I'm gonna kill you too. Maybe I'll blow your head off right now.

To see 5 random lines in the lines.txt evaluate the following cell.

linesRaw.takeSample(false, 5).foreach(println)
L216035 +++$+++ u5302 +++$+++ m351 +++$+++ RAMBO +++$+++ Colonel. L597568 +++$+++ u8300 +++$+++ m564 +++$+++ LOMBARD +++$+++ I don�t. L513032 +++$+++ u7667 +++$+++ m518 +++$+++ LINDA +++$+++ He's no more an Indian than I am though. Anyhow, Doyle's gonna try and tease you and be mean to you to show off to his friends. Just like he does to Frank and me sometimes. You just ignore it. Or stay out here away from 'em if he'll let you. He's an okay guy till he gets drunk but tonight he'll get drunk. I guarantee it. L35914 +++$+++ u313 +++$+++ m19 +++$+++ JESSE +++$+++ Yesss, but I was thinking, I could come by, and then take Zee out. Some place near. With other folk. Near. Here. But out. L426481 +++$+++ u2391 +++$+++ m153 +++$+++ COOLEY +++$+++ - and share one of your graves.
import scala.util.{Failure, Success}

/*  field in line.txt are:
          - lineID
          - characterID (who uttered this phrase)
          - movieID
          - character name
          - text of the utterance
          */
val regexLine = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(.+)\s+(\2)\s*(.*$)""".r

case class lineInMovie(lineID: String, characterID: String, movieID: String, characterName: String, text: String)

val linesRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")
  .map(line => 
          {
            val pLine = regexLine.findFirstMatchIn(line)
                               .map(m => lineInMovie(m.group(1), m.group(3), m.group(5), m.group(7), m.group(9))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              pLine
           }
  )
import scala.util.{Failure, Success} regexLine: scala.util.matching.Regex = \s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(.+)\s+(\2)\s*(.*$) defined class lineInMovie linesRaw: org.apache.spark.rdd.RDD[Product with Serializable with scala.util.Try[lineInMovie]] = MapPartitionsRDD[3737] at map at command-753740454082248:15
linesRaw.filter(x => x.isSuccess).count()
res18: Long = 304713
linesRaw.filter(x => x.isFailure).count()
res19: Long = 0
linesRaw.filter(x => x.isSuccess).take(5).foreach(println)
Success(lineInMovie(L1045,u0,m0,BIANCA,They do not!)) Success(lineInMovie(L1044,u2,m0,CAMERON,They do to!)) Success(lineInMovie(L985,u0,m0,BIANCA,I hope so.)) Success(lineInMovie(L984,u2,m0,CAMERON,She okay?)) Success(lineInMovie(L925,u0,m0,BIANCA,Let's go.))

Let's make a DataFrame out of the successfully parsed line.

val lines 
    = linesRaw
      .filter(x => x.isSuccess)
      .map { case Success(l) => l }
      .toDF()
      .join(moviesMetaData, "movieID") // and join it to get movie meta data
notebook:4: warning: match may not be exhaustive. It would fail on the following input: Failure(_) .map { case Success(l) => l } ^ lines: org.apache.spark.sql.DataFrame = [movieID: string, lineID: string ... 5 more fields]
lines.show(5)
+-------+-------+-----------+-------------+--------------------+-------------+---------+ |movieID| lineID|characterID|characterName| text| movieTitle|movieYear| +-------+-------+-----------+-------------+--------------------+-------------+---------+ | m203|L593445| u3102| HAGEN|You owe the Don a...|the godfather| 1972| | m203|L593444| u3094| BONASERA|Yes, I understand...|the godfather| 1972| | m203|L593443| u3102| HAGEN|This is Tom Hagen...|the godfather| 1972| | m203|L593425| u3102| HAGEN| Yes. |the godfather| 1972| | m203|L593424| u3094| BONASERA|The Don himself i...|the godfather| 1972| +-------+-------+-----------+-------------+--------------------+-------------+---------+ only showing top 5 rows

Dialogs with Lines

Let's join ght two DataFrames on lineID next.

val convLines = conversations.join(lines, "lineID").sort($"conversationID", $"intraConversationID")
convLines: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [lineID: string, conversationID: bigint ... 7 more fields]
convLines.count
res24: Long = 304713
conversations.count
res25: Long = 304713
display(convLines)
L19400m0u0BIANCACan we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.10 things i hate about you1999
L19501m0u2CAMERONWell, I thought we'd start with pronunciation, if that's okay with you.10 things i hate about you1999
L19602m0u0BIANCANot the hacking and gagging and spitting part. Please.10 things i hate about you1999
L19703m0u2CAMERONOkay... then how 'bout we try out some French cuisine. Saturday? Night?10 things i hate about you1999
L19810m0u0BIANCAYou're asking me out. That's so cute. What's your name again?10 things i hate about you1999
L19911m0u2CAMERONForget it.10 things i hate about you1999
L20020m0u0BIANCANo, no, it's my fault -- we didn't have a proper introduction ---10 things i hate about you1999
L20121m0u2CAMERONCameron.10 things i hate about you1999
L20222m0u0BIANCAThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser. My sister. I can't date until she does.10 things i hate about you1999
L20323m0u2CAMERONSeems like she could get a date easy enough...10 things i hate about you1999
L20430m0u2CAMERONWhy?10 things i hate about you1999
L20531m0u0BIANCAUnsolved mystery. She used to be really popular when she started high school, then it was just like she got sick of it or something.10 things i hate about you1999
L20632m0u2CAMERONThat's a shame.10 things i hate about you1999
L20740m0u0BIANCAGosh, if only we could find Kat a boyfriend...10 things i hate about you1999
L20841m0u2CAMERONLet me see what I can do.10 things i hate about you1999
L27150m0u0BIANCAC'esc ma tete. This is my head10 things i hate about you1999
L27251m0u2CAMERONRight. See? You're ready for the quiz.10 things i hate about you1999
L27352m0u0BIANCAI don't want to know how to say that though. I want to know useful things. Like where the good stores are. How much does champagne cost? Stuff like Chat. I have never in my life had to point out my head to someone.10 things i hate about you1999
L27453m0u2CAMERONThat's because it's such a nice one.10 things i hate about you1999
L27554m0u0BIANCAForget French.10 things i hate about you1999
L27660m0u0BIANCAHow is our little Find the Wench A Date plan progressing?10 things i hate about you1999
L27761m0u2CAMERONWell, there's someone I think might be --10 things i hate about you1999
L28070m0u2CAMERONThere.10 things i hate about you1999
L28171m0u0BIANCAWhere?10 things i hate about you1999
L36380m0u2CAMERONYou got something on your mind?10 things i hate about you1999
L36481m0u0BIANCAI counted on you to help my cause. You and that thug are obviously failing. Aren't we ever going on our date?10 things i hate about you1999
L36590m0u2CAMERONYou have my word. As a gentleman10 things i hate about you1999
L36691m0u0BIANCAYou're sweet.10 things i hate about you1999
L367100m0u2CAMERONHow do you get your hair to look like that?10 things i hate about you1999
L368101m0u0BIANCAEber's Deep Conditioner every two days. And I never, ever use a blowdryer without the diffuser attachment.10 things i hate about you1999
L401110m0u2CAMERONSure have.10 things i hate about you1999
L402111m0u0BIANCAI really, really, really wanna go, but I can't. Not unless my sister goes.10 things i hate about you1999
L403112m0u2CAMERONI'm workin' on it. But she doesn't seem to be goin' for him.10 things i hate about you1999
L404120m0u2CAMERONShe's not a...10 things i hate about you1999
L405121m0u0BIANCALesbian? No. I found a picture of Jared Leto in one of her drawers, so I'm pretty sure she's not harboring same-sex tendencies.10 things i hate about you1999
L406122m0u2CAMERONSo that's the kind of guy she likes? Pretty ones?10 things i hate about you1999
L407123m0u0BIANCAWho knows? All I've ever heard her say is that she'd dip before dating a guy that smokes.10 things i hate about you1999
L575130m0u0BIANCAHi.10 things i hate about you1999
L576131m0u2CAMERONLooks like things worked out tonight, huh?10 things i hate about you1999
L577140m0u0BIANCAYou know Chastity?10 things i hate about you1999
L578141m0u2CAMERONI believe we share an art instructor10 things i hate about you1999
L662150m0u2CAMERONHave fun tonight?10 things i hate about you1999
L663151m0u0BIANCATons10 things i hate about you1999
L693160m0u2CAMERONI looked for you back at the party, but you always seemed to be "occupied".10 things i hate about you1999
L694161m0u0BIANCAI was?10 things i hate about you1999
L695162m0u2CAMERONYou never wanted to go out with 'me, did you?10 things i hate about you1999
L696170m0u0BIANCAWell, no...10 things i hate about you1999
L697171m0u2CAMERONThen that's all you had to say.10 things i hate about you1999
L698172m0u0BIANCABut10 things i hate about you1999
L699173m0u2CAMERONYou always been this selfish?10 things i hate about you1999
L860180m0u0BIANCAThen Guillermo says, "If you go any lighter, you're gonna look like an extra on 90210."10 things i hate about you1999
L861181m0u2CAMERONNo...10 things i hate about you1999
L862190m0u0BIANCAdo you listen to this crap?10 things i hate about you1999
L863191m0u2CAMERONWhat crap?10 things i hate about you1999
L864192m0u0BIANCAMe. This endless ...blonde babble. I'm like, boring myself.10 things i hate about you1999
L865193m0u2CAMERONThank God! If I had to hear one more story about your coiffure...10 things i hate about you1999
L866200m0u2CAMERONI figured you'd get to the good stuff eventually.10 things i hate about you1999
L867201m0u0BIANCAWhat good stuff?10 things i hate about you1999
L868202m0u2CAMERONThe "real you".10 things i hate about you1999
L869203m0u0BIANCALike my fear of wearing pastels?10 things i hate about you1999
L870210m0u0BIANCAI'm kidding. You know how sometimes you just become this "persona"? And you don't know how to quit?10 things i hate about you1999
L871211m0u2CAMERONNo10 things i hate about you1999
L872212m0u0BIANCAOkay -- you're gonna need to learn how to lie.10 things i hate about you1999
L924220m0u2CAMERONWow10 things i hate about you1999
L925221m0u0BIANCALet's go.10 things i hate about you1999
L984230m0u2CAMERONShe okay?10 things i hate about you1999
L985231m0u0BIANCAI hope so.10 things i hate about you1999
L1044240m0u2CAMERONThey do to!10 things i hate about you1999
L1045241m0u0BIANCAThey do not!10 things i hate about you1999
L49250m0u0BIANCADid you change your hair?10 things i hate about you1999
L50251m0u3CHASTITYNo.10 things i hate about you1999
L51252m0u0BIANCAYou might wanna think about it10 things i hate about you1999
L571260m0u0BIANCAWhere did he go? He was just here.10 things i hate about you1999
L572261m0u3CHASTITYWho?10 things i hate about you1999
L573262m0u0BIANCAJoey.10 things i hate about you1999
L579270m0u3CHASTITYGreat10 things i hate about you1999
L580271m0u0BIANCAWould you mind getting me a drink, Cameron?10 things i hate about you1999
L595280m0u0BIANCAHe practically proposed when he found out we had the same dermatologist. I mean. Dr. Bonchowski is great an all, but he's not exactly relevant party conversation.10 things i hate about you1999
L596281m0u3CHASTITYIs he oily or dry?10 things i hate about you1999
L597282m0u0BIANCACombination. I don't know -- I thought he'd be different. More of a gentleman...10 things i hate about you1999
L598290m0u3CHASTITYBianca, I don't think the highlights of dating Joey Dorsey are going to include door-opening and coat-holding.10 things i hate about you1999
L599291m0u0BIANCASometimes I wonder if the guys we're supposed to want to go out with are the ones we actually want to go out with, you know?10 things i hate about you1999
L600292m0u3CHASTITYAll I know is -- I'd give up my private line to go out with a guy like Joey.10 things i hate about you1999
L659300m0u0BIANCAI have to be home in twenty minutes.10 things i hate about you1999
L660301m0u3CHASTITYI don't have to be home 'til two.10 things i hate about you1999
L952310m0u3CHASTITYYou think you ' re the only sophomore at the prom?10 things i hate about you1999
L953311m0u0BIANCAI did.10 things i hate about you1999
L394320m0u4JOEYIt's more10 things i hate about you1999
L395321m0u0BIANCAExpensive?10 things i hate about you1999
L396330m0u4JOEYExactly So, you going to Bogey Lowenbrau's thing on Saturday?10 things i hate about you1999
L397331m0u0BIANCAHopefully.10 things i hate about you1999
L589340m0u4JOEYSo yeah, I've got the Sears catalog thing going -- and the tube sock gig " that's gonna be huge. And then I'm up for an ad for Queen Harry next week.10 things i hate about you1999
L590341m0u0BIANCAQueen Harry?10 things i hate about you1999
L591342m0u4JOEYIt's a gay cruise line, but I'll be, like, wearing a uniform and stuff.10 things i hate about you1999
L592350m0u0BIANCANeat...10 things i hate about you1999
L593351m0u4JOEYMy agent says I've got a good shot at being the Prada guy next year.10 things i hate about you1999
L756360m0u4JOEYHey, sweet cheeks.10 things i hate about you1999
L757361m0u0BIANCAHi, Joey.10 things i hate about you1999
L758362m0u4JOEYYou're concentrating awfully hard considering it's gym class.10 things i hate about you1999
L759370m0u4JOEYListen, I want to talk to you about the prom.10 things i hate about you1999

Showing the first 1000 rows.

Let's amalgamate the texts utered in the same conversations together.

By doing this we loose all the information in the order of utterance.

But this is fine as we are going to do LDA with just the first-order information of words uttered in each conversation by anyone involved in the dialogue.

import org.apache.spark.sql.functions.{collect_list, udf, lit, concat_ws}

val corpusDF = convLines.groupBy($"conversationID",$"movieID")
  .agg(concat_ws(" :-()-: ",collect_list($"text")).alias("corpus"))
  .join(moviesMetaData, "movieID") // and join it to get movie meta data
  .select($"conversationID".as("id"),$"corpus",$"movieTitle",$"movieYear")
  .cache()
import org.apache.spark.sql.functions.{collect_list, udf, lit, concat_ws} corpusDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, corpus: string ... 2 more fields]
corpusDF.count()
res28: Long = 83097
corpusDF.take(5).foreach(println)
[28762,Your wife and children...you're happy with them? :-()-: Yes. :-()-: Good.,the godfather,1972] [28815,Michael? :-()-: I'm thinking about it. :-()-: Michael... :-()-: No, I would not like you better if you were Ingrid Bergman.,the godfather,1972] [28842,What is it? :-()-: Is it all right if I go to the bathroom?,the godfather,1972] [28766,Things went badly in Palermo? :-()-: The younger men have no respect. Things are changing; I don't know what will happen. Michael, because of the wedding, people now know your name. :-()-: Is that why there are more men on the walls? :-()-: Even so, I don't think it is safe here anymore. I've made plans to move you to a villa near Siracuse. You must go right away. :-()-: What is it? :-()-: Bad news from America. Your brother, Santino. He has been killed.,the godfather,1972] [28835,We can't wait. No matter what Sollozzo say about a deal, he's figuring out how to kill Pop. You have to get Sollozzo now. :-()-: The kid's right.,the godfather,1972]
display(corpusDF)
28762Your wife and children...you're happy with them? :-()-: Yes. :-()-: Good.the godfather1972
28815Michael? :-()-: I'm thinking about it. :-()-: Michael... :-()-: No, I would not like you better if you were Ingrid Bergman.the godfather1972
28842What is it? :-()-: Is it all right if I go to the bathroom?the godfather1972
28766Things went badly in Palermo? :-()-: The younger men have no respect. Things are changing; I don't know what will happen. Michael, because of the wedding, people now know your name. :-()-: Is that why there are more men on the walls? :-()-: Even so, I don't think it is safe here anymore. I've made plans to move you to a villa near Siracuse. You must go right away. :-()-: What is it? :-()-: Bad news from America. Your brother, Santino. He has been killed.the godfather1972
28835We can't wait. No matter what Sollozzo say about a deal, he's figuring out how to kill Pop. You have to get Sollozzo now. :-()-: The kid's right.the godfather1972
28717Jesus, Connie...Sure, Mike... :-()-: Go back to your house and wait for me...the godfather1972
28735I tol' you to stay put, Paulie... :-()-: The guy at the gate's outside...says there's a package...the godfather1972
28752What is this nonsense? :-()-: It's from Johnny. It was announced this morning. He's going to play the lead in the new Woltz Brothers film.the godfather1972
28852You straightened my brother out? :-()-: Hell, he was banging cocktail waitresses two at a time. Players couldn't get a drink.the godfather1972
28782Tom, you're the Consigliere, what do we do if the old man dies? :-()-: Without your father's political contacts and personal influence, the Corleone family loses half its strength. Without your father, the other New York families might wind up supporting Sollozzo, and the Tattaglias just to make sure there isn't a long destructive war. The old days are over, this is 1946; nobody wants bloodshed anymore. If your father dies...make the deal, Sonny. :-()-: That's easy to say; it's not your father. :-()-: I was as good a son to him as you or Mike. :-()-: Oh Christ Tom, I didn't mean it that way. :-()-: We're all tired... :-()-: OK, we sit tight until the old man can give us the lead. But Tom, I want you to stay inside the Mall. You too, Mike, no chances. Tessio, you hold your people in reserve, but have them nosing around the city. The hospital is yours; I want it tight, fool-proof, 24 hours a day.the godfather1972
28840We're going to New Jersey? :-()-: Maybe.the godfather1972
28729Sollozzo knows Mike's a civilian. :-()-: OK, but be careful.the godfather1972
28730I want somebody very good, very safe to plant that gun. I don't want my brother coming out of that toilet with just his dick in his hand. :-()-: The gun will be there. :-()-: You're on, kid...I'll square it with Mom your not seeing her before you left. And I'll get a message to your girl friend when I think the time is right. :-()-: We gotta move...the godfather1972
28774Ever seen anything like that before? :-()-: No.the godfather1972
28724The food is on the table. :-()-: I'm not hungry yet. :-()-: Eat it, it's on the table. :-()-: Ba Fa Goulle. :-()-: BA FA GOULE YOU!the godfather1972
28839I'm glad you came, Mike. I hope we can straighten everything out. All this is terrible, it's not the way I wanted things to happen at all. It should never have happened. :-()-: I want to settle things tonight. I want my father left alone. :-()-: He won't be; I swear to you be my children he won't be. Just keep an open mind when we talk. I hope you're not a hothead like your brother, Sonny. It's impossible to talk business with him.the godfather1972
28827You cannot stay here...I'm sorry. :-()-: You and I are going to move my father right now...to another room on another floor...Can you disconnect those tubes so we can wheel the bed out? :-()-: Absolutely not! We have to get permission from the Doctor. :-()-: You've read about my father in the papers. You've seen that no one's here to guard him. Now I've just gotten word that men are coming to this hospital to kill him. Believe me and help me. :-()-: We don't have to disconnect them, we can wheel the stand with the bed.the godfather1972
28721What's the matter, Carlo? :-()-: Shut up.the godfather1972
28820When will I see you again? :-()-: Goodbye.the godfather1972
28843Do you pledge to guide and protect this child if he is left fatherless? Do you promise to shield him against the wickedness of the world? :-()-: Yes, I promise.the godfather1972
28816Hello. Kay? :-()-: How is your father? :-()-: He'll be OK. :-()-: I love you.the godfather1972
28751When does my daughter leave with her bridegroom? :-()-: They'll cut the cake in a few minutes...leave right after that. Your new son-in-law, do we give him something important? :-()-: No, give him a living. But never let him know the family's business. What else, Tom? :-()-: I've called the hospital; they've notified Consiglere Genco's family to come and wait. He won't last out the night.the godfather1972
28728You take care of Paulie? :-()-: You won't see Paulie anymore. He's sick for good this winter.the godfather1972
28744All right, Hollywood...Now tell me about this Hollywood Pezzonovanta who won't let you work. :-()-: He owns the studio. Just a month ago he bought the movie rights to this book, a best seller. And the main character is a guy just like me. I wouldn't even have to act, just be myself.the godfather1972
28780Is the hospital covered? :-()-: The cops have it locked in and I got my people there visiting Pop all the time. What about the hit list.the godfather1972
28781What about Luca? Sollozzo didn't seem worried about Luca. That worries me. :-()-: If Luca sold out we're in real trouble. :-()-: Has anyone been able to get in touch with him? :-()-: No, and I've been calling all night. Maybe he's shacked up. :-()-: Luca never sleeps over with a broad. He always goes home when he's through. Mike, keep ringing Luca's number.the godfather1972
28785I found out about this Captain McCluskey who broke Mike's jaw. He's definitely on Sollozzo's payroll, and for big money. McCluskey's agreed to be the Turk's bodyguard. What you have to understand is that while Sollozzo is guarded like this, he's invulnerable. Nobody has ever gunned down a New York Police Captain. Never. It would be disastrous. All the five families would come after you Sonny; the Corleone family would be outcasts; even the old man's political protection would run for cover. So just...take that into consideration. :-()-: McCluskey can't stay with the Turk forever. We'll wait.the godfather1972
28761Barzini will move against you first. :-()-: How? :-()-: He will get in touch with you through someone you absolutely trust. That person will arrange a meeting, guarantee your safety...the godfather1972
28738Good for ten men... :-()-: OK, go to Arthur Avenue; I'm suppose to call when I found somethin'.the godfather1972
28803I've never seen anything like it. :-()-: I told you I had a lot of relatives.the godfather1972
28836Go on Mike. :-()-: They want me to go to the conference with Sollozzo. Set up the meeting for two days from now. Sonny, get our informers to find out where the meeting will be held. Insist it has to be a public place: a bar or restaurant at the height of the dinner hour. So I'll feel safe. They'll check me when I meet them so I won't be able to carry a weapon; but Clemenza, figure out a way to have one planted there for me. Then I'll kill them both.the godfather1972
28720Don't be frightened. Do you think I'd make my sister a widow? Do you think I'd make your children fatherless? After all, I'm Godfather to your son. No, your punishment is that you're out of the family business. I'm putting you on a plane to Vegas--and I want you to stay there. I'll send Connie an allowance, that's all. But don't keep saying you're innocent; it insults my intelligence and makes me angry. Who approached you, Tattaglia or Barzini? :-()-: Barzini. :-()-: Good, good. Leave now; there's a car waiting to take you to the airport.the godfather1972
28745You take care of your family? :-()-: Sure.the godfather1972
28753My wife was weeping before she fell asleep, outside my window I saw my caporegimes to the house, and it is midnight. So, Consigliere of mine, I think you should tell your Don what everyone knows. :-()-: I didn't tell Mama anything. I was about to come up and wake you and tell you. Just now. :-()-: But you needed a drink first. :-()-: Yes. :-()-: Now you've had your drink.the godfather1972
28767You tell us about America. :-()-: How do you know I come from America? :-()-: We hear. We were told you were a Pezzonovanta...big shot. :-()-: Only the son of a Pezzonovanta. :-()-: Hey America! Is she as rich as they say? :-()-: Yes. :-()-: Take me to America! You need a good lupara in America? You take me, I'll be the best man you got. "Oh say, can you seeee...By da star early light..."the godfather1972
28797Hello Kay. Your father's inside, doing some business. He's been asking for you. :-()-: Thanks Tom.the godfather1972
28814Would you like me better if I were a nun? :-()-: No. :-()-: Would you like me better if I were Ingrid Bergman?the godfather1972
28817I LOVE YOU. :-()-: Yeah Kay, I'm here. :-()-: Can you say it? :-()-: Huh? :-()-: Tell me you love me.the godfather1972
28845Do you wish to be baptized? :-()-: I do wish to be baptized.the godfather1972
28794Sonny was hot for my deal, right? You know it's the smart thing to do, too. I want you to talk Sonny into it. :-()-: Sonny will come after you with everything he's got.the godfather1972
28758Have you thought about a wife? A family? :-()-: No. :-()-: I understand, Michael. But you must make a family, you know. :-()-: I want children, I want a family. But I don't know when. :-()-: Accept what's happened, Michael. :-()-: I could accept everything that's happened; I could accept it, but that I never had a choice. From the time I was born, you had laid this all out for me. :-()-: No, I wanted other things for you. :-()-: You wanted me to be your son. :-()-: Yes, but sons who would be professors, scientists, musicians...and grandchildren who could be, who knows, a Governor, a President even, nothing's impossible here in America. :-()-: Then why have I become a man like you? :-()-: You are like me, we refuse to be fools, to be puppets dancing on a string pulled by other men. I hoped the time for guns and killing and massacres was over. That was my misfortune. That was your misfortune. I was hunted on the streets of Corleone when I was twelve years old because of who my fat...the godfather1972
28809If he's your brother, why does he have a different name? :-()-: My brother Sonny found him living in the streets when he was a kid, so my father took him in. He's a good lawyer.the godfather1972
28807I never know when you're telling me the truth. :-()-: I told you you wouldn't like him. :-()-: He's coming over here!the godfather1972
28719You fingered Sonny for the Barzini people. That little farce you played out with my sister. Did Barzini kid you that would fool a Corleone? :-()-: I swear I'm innocent. I swear on the head of my children, I'm innocent. Mike, don't do this to me, please Mike, don't do this to me! :-()-: Barzini is dead. So is Philip Tattaglia, so are Strachi, Cuneo and Moe Greene...I want to square all the family accounts tonight. So don't tell me you're innocent; admit what you did.the godfather1972
28798Sure. Anything I can do for you. :-()-: No. I guess I'll see you Christmas. Everyone's going to be out at Long Beach, right? :-()-: Right.the godfather1972
28792Will you give this to him. :-()-: If I accept that letter and you told a Court of Law I accepted it, they would interpret it as my having knowledge of his whereabouts. Just wait Kay, he'll contact you.the godfather1972
28822Your sister wants to ask you something. :-()-: Let HER ask.the godfather1972
28784Was there a definite proposal? :-()-: Sure, he wants us to send Mike to meet him to hear his proposition. The promise is the deal will be so good we can't refuse. :-()-: What about that Tattaglias? What will they do about Bruno? :-()-: Part of the deal: Bruno cancels out what they did to my father. :-()-: We should hear what they have to say. :-()-: No, no Consiglere. Not this time. No more meetings, no more discussions, no more Sollozzo tricks. Give them one message: I WANT SOLLOZZO. If not, it's all out war. We go to the mattresses and we put all the button men out on the street. :-()-: The other families won't sit still for all out war. :-()-: Then THEY hand me Sollozzo. :-()-: Come ON Sonny, your father wouldn't want to hear this. This is not a personal thing, this is Business. :-()-: And when they shot me father... :-()-: Yes, even the shooting of your father was business, not personal... :-()-: No no, no more advice on how to patch it up Tom. You just help me win. Underst...the godfather1972
28846Barzini's people chisel my territory and we do nothing about it. Pretty soon there won't be one place in Brooklyn I can hang my hat. :-()-: Just be patient. :-()-: I'm not asking you for help, Mike. Just take off the handcuffs. :-()-: Be patient.the godfather1972
28786One of Tattaglia's people? :-()-: No. Our informer in McCluskey's precinct. Tonight at 8:00 he signed out for Louis' Restaurant in the Bronx. Anyone know it.the godfather1972
28754When I meet with Tattaglia's people; should I insist that all his drug middle-men be clean? :-()-: Mention it, don't insist. Barzini is a man who will know that without being told. :-()-: You mean Tattaglia. :-()-: Barzini. :-()-: He was the one behind Sollozzo? :-()-: Tattaglia is a pimp. He could never have outfought Santino. But I wasn't sure until this day. No, it was Barzini all along.the godfather1972
28760I see you have your Luca Brasi. :-()-: I'll need him. :-()-: There are men in this world who demand to be killed. They argue in gambling games; they jump out of their cars in a rage if someone so much as scratches their fender. These people wander through the streets calling out "Kill me, kill me." Luca Brasi was like that. And since he wasn't scared of death, and in fact, looked for it...I made him my weapon. Because I was the only person in the world that he truly hoped would not kill him. I think you have done the same with this man.the godfather1972
28802Christ, Tom; I needed more time with him. I really needed him. :-()-: Did he give you his politicians? :-()-: Not all...I needed another four months and I would have had them all. I guess you've figured it all out? :-()-: How will they come at you? :-()-: I know now. I'll make them call me Don. :-()-: Have you agreed on a meeting? :-()-: A week from tonight. In Brooklyn on Tessio's ground, where I'll be safe.the godfather1972
28771It's the real Thunderbolt, then. :-()-: Come Sunday morning: My name is Vitelli and my house is up there on the hill, above the village.the godfather1972
28829All right, Mikey...who do we have to hit, Clemenza or Paulie? :-()-: What? :-()-: One of them fingered the old man.the godfather1972
28795The Don was slipping; in the old days I could never have gotten to him. Now he's dead, nothing can bring him back. Talk to Sonny, talk to the Caporegimes, Clemenza and Tessio...it's good business. :-()-: Even Sonny won't be able to call off Luca Brasi. :-()-: I'll worry about Luca. You take care of Sonny and the other two kids. :-()-: I'll try...It's what the Don would want us to do. :-()-: Good...then you can go... I don't like violence. I'm a businessman, and blood is a big expense.the godfather1972
28715What do you wish me to do? :-()-: I want you to use all your powers, all your skill, as you love me. I do not want his mother to see him as he is.the godfather1972
28757What was this for? :-()-: For bravery. :-()-: And this? :-()-: For killing a man. :-()-: What miracles you do for strangers. :-()-: I fought for my country. It was my choice. :-()-: And now, what do you choose to do? :-()-: I'm going to finish school. :-()-: Good. When you are finished, come and talk to me. I have hopes for you.the godfather1972
28741My compliments. I'll take care of them from my share. :-()-: So. I receive 30 per cent just for finance and legal protection. No worries about operations, is that what you tell me? :-()-: If you think two million dollars in cash is just finance, I congratulate you Don Corleone.the godfather1972
28851The Corleone family wants to buy me out. I buy you out. You don't buy me out. :-()-: Your casino loses money. Maybe we can do better. :-()-: You think I scam? :-()-: You're unlucky. :-()-: You goddamn dagos. I do you a favor and take Freddie in when you're having a bad time, and then you try to push me out. :-()-: You took Freddie in because the Corleone family bankrolled your casino. You and the Corleone family are evened out. This is for business; name your price. :-()-: The Corleone family don't have that kind of muscle anymore. The Godfather is sick. You're getting chased out of New York by Barzini and the other families, and you think you can find easier pickings here. I've talked to Barzini; I can make a deal with him and keep my hotel! :-()-: Is that why you thought you could slap Freddie around in public?the godfather1972
28844Do you renounce Satan. :-()-: I do renounce him. :-()-: And all his works? :-()-: I do renounce them.the godfather1972
28825Michael, it's not true. Please tell me. :-()-: Don't ask me. :-()-: Tell me! :-()-: All right, this one time I'll let you ask about my affairs, one last time. :-()-: Is it true?the godfather1972
28841Most important...I want a sure guarantee that no more attempts will be made on my father's life. :-()-: What guarantees can I give you? I am the hunted one. I've missed my chance. You think too highly of me, my friend...I am not so clever...all I want if a truce...the godfather1972
28789We'll let the old man take it easy for a couple of weeks. I want to get things going good before he gets better. What's the matter with you? :-()-: You start operating, the five families will start their raids again. We're at a stalemate Sonny, your war is costing us a lot of money. :-()-: No more stalemate Tom, we got the soldiers, we'll match them gun for gun if that's how they want it. They know me for what I am, Tom-- and they're scared of me. :-()-: Yes. That's true, you're getting a hell of a reputation. :-()-: Well it's war! We might not be in this shape if we had a real war- time Consiglere, a Sicilian. Pop had Genco, who do I have? Hey Tom, hey...hey. It's Sunday, we're gonna have dinner. Don't be sore.the godfather1972
28812What will your father say? :-()-: As long as I tell him beforehand he won't object. He'll be hurt, but he won't object. :-()-: What time do they expect us? :-()-: For dinner. Unless I call and tell them we're still in New Hampshire. :-()-: Michael. :-()-: Then we can have dinner, see a show, and spend one more night.the godfather1972
28793Will you give this letter to Michael. :-()-: Mama, no.the godfather1972
28726You heard about your father? :-()-: Yeah. :-()-: The word is out in the streets that he's dead. :-()-: Where the hell was Paulie, why wasn't he with the Don? :-()-: Paulie's been a little sick all winter...he was home. :-()-: How many times did he stay home the last couple of months? :-()-: Maybe three, four times. I always asked Freddie if he wanted another bodyguard, but he said no. Things have been so smooth the last ten years... :-()-: Go get Paulie, I don't care how sick he is. Pick him up yourself, and bring him to my father's house. :-()-: That's all? Don't you want me to send some people over here? :-()-: No, just you and Paulie.the godfather1972
28788Pop, they hit us and we hit them back. :-()-: We put out a lot of material through our contacts in the Newspapers...about McCluskey's being tied up with Sollozzo in the Drug Rackets...things are starting to loosen up.the godfather1972
28716Understood. I just wish I was doing more to help out. :-()-: I'll come to you when I need you.the godfather1972
28783Maybe Mike shouldn't get mixed up in this so directly. You know the old man doesn't want that. :-()-: OK forget it, just stay on the phone.the godfather1972
28736Outside. :-()-: Sure.the godfather1972
28773You look wonderful, kid; really wonderful. That doctor did some job on your face. :-()-: You look good, too.the godfather1972
28776Who are those girls? :-()-: That's for you to find out. :-()-: Give them some money and send them home. :-()-: Mike! :-()-: Get rid of them...the godfather1972
28810I didn't know your family knew Johnny Fontane. :-()-: Sure. :-()-: I used to come down to New York whenever he sang at the Capitol and scream my head off. :-()-: He's my father's godson; he owes him his whole career.the godfather1972
28804Michael, what are those men doing? :-()-: They're waiting to see my father. :-()-: They're talking to themselves. :-()-: They're going to talk to my father, which means they're going to ask him for something, which means they better get it right. :-()-: Why do they bother him on a day like this? :-()-: Because they know that no Sicilian will refuse a request on his daughter's wedding day.the godfather1972
28787Jesus, I don't know... :-()-: Can you do it Mike?the godfather1972
28833Sonny...Sonny--Jesus Christ, I'm down at the hospital. I came down late. There's no one here. None of Tessio's people--no detectives, no one. The old man is completely unprotected. :-()-: All right, get him in a different room; lock the door from the inside. I'll have some men there inside of fifteen minutes. Sit tight, and don't panic. :-()-: I won't panic.the godfather1972
28747Is it necessary? :-()-: You understand him better than anyone.the godfather1972
28849Mike, good to see you. Got everything you want? :-()-: Thanks. :-()-: The chef cooked for you special; the dancers will kick your tongue out and you credit is good! Draw chips for all these people so they can play on the house. :-()-: Is my credit good enough to buy you out?the godfather1972
28834Mikey, you look beautiful! :-()-: Cut it out. :-()-: The Turk wants to talk! The nerve of that son of a bitch! After he craps out last night he wants a meet.the godfather1972
28714Be my friend. :-()-: Good. From me you'll get Justice. :-()-: Godfather. :-()-: Some day, and that day may never come, I would like to call upon you to do me a service in return.the godfather1972
28768Hey, beautiful girls! :-()-: Shhhhh.the godfather1972
28821I have to see my father and his people when we get back to the Mall. :-()-: Oh Michael. :-()-: We'll go to the show tomorrow night--we can change the tickets. :-()-: Don't you want dinner first? :-()-: No, you eat...don't wait up for me. :-()-: Wake me up when you come to bed?the godfather1972
28718Godfather! :-()-: You have to answer for Santino.the godfather1972
28750It is Johnny. He came all the way from California to be at the wedding. :-()-: Should I bring him in. :-()-: No. Let the people enjoy him. You see? He is a good godson. :-()-: It's been two years. He's probably in trouble again.the godfather1972
28847Let us fill up our Regimes. :-()-: No. I want things very calm for another six months. :-()-: Forgive me, Godfather, let our years of friendship be my excuse. How can you hope for success there without your strength here to back you up? The two go hand in hand. And with you gone from here the Barzini and the Tattaglias will be too strong for us.the godfather1972
28791What was that? :-()-: An accident. No one was hurt. :-()-: Listen Tom, I let my cab go; can I come in to call another one?the godfather1972
28763...a fine boy from Sicily, captured by the American Army, and sent to New Jersey as a prisoner of war... :-()-: Nazorine, my friend, tell me what I can do. :-()-: Now that the war is over, Enzo, this boy is being repatriated to Italy. And you see, Godfather... He...my daughter...they... :-()-: You want him to stay in this country. :-()-: Godfather, you understand everything. :-()-: Tom, what we need is an Act of Congress to allow Enzo to become a citizen. :-()-: An Act of Congress!the godfather1972
28826I'm Michael Corleone--this is my father. What happened to the detectives who were guarding him? :-()-: Oh your father just had too many visitors. It interfered with the hospital service. The police came and made them all leave just ten minutes ago. But don't worry. I look in on him. :-()-: You just stand here one minute...the godfather1972
28853I have to go back to New York tomorrow. Think of your price. :-()-: You son of a bitch, you think you can brush me off like that? I made my bones when you were going out with cheerleaders.the godfather1972
28831Is it going to be all-out war, like last time? :-()-: Until the old man tells me different. :-()-: Then wait, Sonny. Talk to Pop. :-()-: Sollozzo is a dead man, I don't care what it costs. I don't care if we have to fight all the five families in New York. The Tattaglia family's going to eat dirt. I don't care if we all go down together. :-()-: That's not how Pop would have played it. :-()-: I know I'm not the man he was. But I'll tell you this and he'll tell you too. When it comes to real action, I can operate as good as anybody short range. :-()-: All right, Sonny. All right. :-()-: Christ, if I could only contact Luca. :-()-: Is it like they say? Is he that good?the godfather1972
28737I'll think about it. :-()-: Drive while you thinking; I wanna get to the City this month!the godfather1972
28805No. His name is Luca Brasi. You wouldn't like him. :-()-: Who is he? :-()-: You really want to know? :-()-: Yes. Tell me. :-()-: You like spaghetti? :-()-: You know I love spaghetti. :-()-: Then eat your spaghetti and I'll tell you a Luca Brasi story.the godfather1972
28734You look terrif on the floor! :-()-: What are you, a dance judge? Go do your job; take a walk around the neighborhood... see everything is okay.the godfather1972
28790Kay, we weren't expecting you. You should call... :-()-: I've tried calling and writing. I want to reach Michael. :-()-: Nobody knows where he is. We know he's all right, but that's all.the godfather1972
28811We have something for your mother, for Sonny, we have the tie for Fredo and Tom Hagen gets the Reynolds pen... :-()-: And what do you want for Christmas? :-()-: Just you.the godfather1972
28850Buy me out?... :-()-: The hotel, the casino. The Corleone family wants to buy you out.the godfather1972
28813Michael, what are you doing? :-()-: Shhh, you be the long distance operator. Here. :-()-: Hello...this is Long Distance. I have a call from New Hampshire. Mr. Michael Corleone. One moment please.the godfather1972
28725You filthy guinea spoiled brat. Clean it up or I'll kick your head in. :-()-: Like hell I will.the godfather1972
28710This is Tom Hagen; I'm calling for Don Corleone, at his request. :-()-: Yes, I understand I'm listening. :-()-: You owe the Don a service. He has no doubt that you will repay it.the godfather1972

Showing the first 1000 rows.

Feature extraction and transformation APIs

We will use the convenient Feature extraction and transformation APIs.

Step 3. Text Tokenization

We will use the RegexTokenizer to split each document into tokens. We can setMinTokenLength() here to indicate a minimum token length, and filter away all tokens that fall below the minimum. See:

import org.apache.spark.ml.feature.RegexTokenizer

// Set params for RegexTokenizer
val tokenizer = new RegexTokenizer()
.setPattern("[\\W_]+") // break by white space character(s)
.setMinTokenLength(4) // Filter away tokens with length < 4
.setInputCol("corpus") // name of the input column
.setOutputCol("tokens") // name of the output column

// Tokenize document
val tokenized_df = tokenizer.transform(corpusDF)
import org.apache.spark.ml.feature.RegexTokenizer tokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_5380a11bc0d5 tokenized_df: org.apache.spark.sql.DataFrame = [id: bigint, corpus: string ... 3 more fields]
display(tokenized_df.sample(false,0.001,1234L)) 
28770You had better bring a few bottles home with you, my friend; you'll need help sleeping tonight. :-()-: This one could seduce the devil. A body! and eyes as big and black as olives. :-()-: I know about what you mean! :-()-: This was a beauty. Right, Calo? :-()-: Beautiful all over, eh? :-()-: And hair. Black and curly, like a doll. And such a mouth.the godfather1972["better","bring","bottles","home","with","friend","need","help","sleeping","tonight","this","could","seduce","devil","body","eyes","black","olives","know","about","what","mean","this","beauty","right","calo","beautiful","over","hair","black","curly","like","doll","such","mouth"]
47246Well...What are we doing? :-()-: You're being useless. I'm making us rich.entrapment1999["well","what","doing","being","useless","making","rich"]
37839I like you. I don't know what it is exactly. :-()-: My tits? :-()-: No, no, it's your energy or your attitude or the way you carry yourself or... :-()-: Christ, you're not a fag are you? Because I don't want to be wasting my time.being john malkovich1999["like","know","what","exactly","tits","your","energy","your","attitude","carry","yourself","christ","because","want","wasting","time"]
26952I'm all right -- stand by to return fire! Mr. Scott, transfer power to the phaser banks -- :-()-: Oh, God, sir, I dinna think so... :-()-: What's wrong? :-()-: They've knocked out the damn automation center. I've got no control over anything!star trek iii: the search for spock1984["right","stand","return","fire","scott","transfer","power","phaser","banks","dinna","think","what","wrong","they","knocked","damn","automation","center","control","over","anything"]
72094Plasma, ma'am... from the planet. :-()-: Bug Batteries... According to Military Intelligence, it'll be random and light. Drop status ?starship troopers1997["plasma","from","planet","batteries","according","military","intelligence","random","light","drop","status"]
2895I've got you, Dr. Evil! :-()-: Well done, Mr. Powers. We're not so different, you and I. It's true, you're British, and I'm Belgian. You have a full head of hair, mine is slightly receding. You're thin, I'm about forty pounds overweight. OK, we are different, I'm not making a very good point. However, isn't it ironic, Mr. Powers, that the very things you stand for-- swinging, free love, parties, distrust of authority- are all now, in the Nineties, considered to be... evil? Maybe we have more in common than you care to admit. :-()-: No, man, what we swingers were rebelling against were uptight squares like you, whose bag was money and world domination. We were innocent, man. If we'd known the consequences of our sexual liberation, we would have done things differently, but the spirit would have remained the same. It's freedom, man. :-()-: Your freedom has cause more pain and suffering in the world than any plan I ever dreamed of. Face it, freedom failed. :-()-: That's why right now is ...austin powers: international man of mystery1997["evil","well","done","powers","different","true","british","belgian","have","full","head","hair","mine","slightly","receding","thin","about","forty","pounds","overweight","different","making","very","good","point","however","ironic","powers","that","very","things","stand","swinging","free","love","parties","distrust","authority","nineties","considered","evil","maybe","have","more","common","than","care","admit","what","swingers","were","rebelling","against","were","uptight","squares","like","whose","money","world","domination","were","innocent","known","consequences","sexual","liberation","would","have","done","things","differently","spirit","would","have","remained","same","freedom","your","freedom","cause","more","pain","suffering","world","than","plan","ever","dreamed","face","freedom","failed","that","right","very","groovy","time","still","have","freedom","also","have","responsibility","really","there","nothing","more","pathetic","than","aging","hipster"]
51986What do you mean "he didn't talk?" You sat there for an hour? :-()-: No, he just sat there and counted the seconds until the session was over. It was pretty impressive, actually. :-()-: Why would he do that? :-()-: To show me he doesn't have to talk to me if he doesn't want to. :-()-: Oh, what is this? Some kind of staring contest between two kids from the "old neighborhood?" :-()-: I won't talk first.good will hunting1997["what","mean","didn","talk","there","hour","just","there","counted","seconds","until","session","over","pretty","impressive","actually","would","that","show","doesn","have","talk","doesn","want","what","this","some","kind","staring","contest","between","kids","from","neighborhood","talk","first"]
47714We shall want a full autopsy ... :-()-: With particular emphasis on the cranial and oral areas. :-()-: Keep him in cold storage till the reports in. Then send him to Taxidermy. He's a museum piece.escape from the planet of the apes1971["shall","want","full","autopsy","with","particular","emphasis","cranial","oral","areas","keep","cold","storage","till","reports","then","send","taxidermy","museum","piece"]
35804Why don't you read something after the break? DOUG What here? Weren't you listening to what I just said? :-()-: You used to read. :-()-: Well not any more, now I'm a serious writer and above this crap.asylum2005["read","something","after","break","doug","what","here","weren","listening","what","just","said","used","read","well","more","serious","writer","above","this","crap"]
35110We need all our fuel anyway. :-()-: Wait -- wait -- don't get up tight -- what I meant was we'd need a whole drum for that -- :-()-: Sit down -- we'll talk about it.apocalypse now1979["need","fuel","anyway","wait","wait","tight","what","meant","need","whole","drum","that","down","talk","about"]
72299There's no way to know. V'ger expected it to be a machine -- some single entity. All of us here may be reduced into patterns... :-()-: That seems to be what it has planned -- to have the Creator physically present here.star trek: the motion picture1979["there","know","expected","machine","some","single","entity","here","reduced","into","patterns","that","seems","what","planned","have","creator","physically","present","here"]
56902It's just the two raptors, right? You're sure the third one's contained? :-()-: Yes, unless they figured out how to open doors.jurassic park1993["just","raptors","right","sure","third","contained","unless","they","figured","open","doors"]
75338You smell that? :-()-: It's coming. Run.the relic1997["smell","that","coming"]
64361You're looking pretty chipper this morning. :-()-: I'm still here, aren't I? I may as well enjoy myself. I'm going to go to school today. Dad, I want to apologize for yesterday. The car is a classic. Use it in the best of health.peggy sue got married1986["looking","pretty","chipper","this","morning","still","here","aren","well","enjoy","myself","going","school","today","want","apologize","yesterday","classic","best","health"]
12637Grandpa, if you think of something hard enough, can you make it happen? :-()-: Apparently so.hope and glory1987["grandpa","think","something","hard","enough","make","happen","apparently"]
29473That porter was Gray and the gentleman of consequence who couldn't swallow the shame of it -- who took my last paltry savings to hire Gray -- :-()-: MacFarlanethe body snatcher1945["that","porter","gray","gentleman","consequence","couldn","swallow","shame","took","last","paltry","savings","hire","gray","macfarlane"]
29545Was the paralysis immediate? :-()-: No, Doctor. She seemed to get better, then about six months later she began to complain of pain in her back -- :-()-: How long after that was the paralysis complete? :-()-: Nearly a year. :-()-: Any attacks of pain since? :-()-: Yes, Doctor. :-()-: Is her pain sporadic or constant? :-()-: It comes at intervals. They used to be months apart -- but they've been growing more frequent -- much more frequent. :-()-: See here, child, when you have this pain in your back, where is it?the body snatcher1945["paralysis","immediate","doctor","seemed","better","then","about","months","later","began","complain","pain","back","long","after","that","paralysis","complete","nearly","year","attacks","pain","since","doctor","pain","sporadic","constant","comes","intervals","they","used","months","apart","they","been","growing","more","frequent","much","more","frequent","here","child","when","have","this","pain","your","back","where"]
60698Susan, listen to me: you handled that insect almost as much as Siri... :-()-: It didn't bite me. :-()-: I know. But if it was carrying something...there's a chance you could have been exposed.mimic1997["susan","listen","handled","that","insect","almost","much","siri","didn","bite","know","carrying","something","there","chance","could","have","been","exposed"]
7358Careful! They're not supposed to hurt you. :-()-: You've got to let me go!erik the viking1989["careful","they","supposed","hurt"]
7341How do we know this is the way? :-()-: We blew the Horn Resounding. :-()-: SHE blew the Horn Resounding.erik the viking1989["know","this","blew","horn","resounding","blew","horn","resounding"]
60403Anything? :-()-: Nothing.miami vice2006["anything","nothing"]
9454Wil... :-()-: Yes, general? :-()-: You've served me loyally, year after year, without complaining. I've thought hard about you this past winter. I want to free you, Wil. I want to give you your freedom, after this battle is fought. :-()-: Yes, general. :-()-: Wil, I'm giving you your freedom. Do you understand? :-()-: No. I guess. :-()-: You'd have money every year, so you wouldn't have to work. You can stay at Mount Vernon as long as you want...george washington2000["general","served","loyally","year","after","year","without","complaining","thought","hard","about","this","past","winter","want","free","want","give","your","freedom","after","this","battle","fought","general","giving","your","freedom","understand","guess","have","money","every","year","wouldn","have","work","stay","mount","vernon","long","want"]
38640...Why? :-()-: You shouldn't have taken the money...blood simple.1984["shouldn","have","taken","money"]
50929Do what I tell you, it's not a game. :-()-: It's all a game, don't bother me.the getaway1972["what","tell","game","game","bother"]
6006You hated being alone. Couldn't stand it. Busy every minute. Always plugged into something. :-()-: I didn't know what really being alone was. No one back here does.cast away2000["hated","being","alone","couldn","stand","busy","every","minute","always","plugged","into","something","didn","know","what","really","being","alone","back","here","does"]
11281So they're trying to kill you and your baby. Don't tell me. Your name also happens to be Rosemary. :-()-: No -- please listen! They're coming ... coming for me and my baby.halloween: the curse of michael myers1995["they","trying","kill","your","baby","tell","your","name","also","happens","rosemary","please","listen","they","coming","coming","baby"]
61727Good golly miss molly you are looking good today ! :-()-: Thank you. :-()-: I meant him."murderland"2009["good","golly","miss","molly","looking","good","today","thank","meant"]
76457What sort of puppy are you looking for? :-()-: What sort have you got? :-()-: Pups. Bitches. From three to twelve months. Trained and untrained ones. White and brown ones. You understand? :-()-: Yeah. :-()-: We also provide 24-hour after-sale service. Were the puppy to fall sick or accidently die, we would unburden you, you understand? :-()-: Yes... Good, good... :-()-: So, what are you looking for? :-()-: What about an untrained pup, white... :-()-: How much of a hurry are you in? :-()-: Tomorrow? :-()-: I�m afraid the only pups currently available at such notice are brown and trained. But they are all very cheerful and have been thoroughly checked for diseases... :-()-: I see. How much? :-()-: Fifteen for a straight delivery. Twenty with the provision of a safe place. Visitors tend to find the second option more convenient. :-()-: ... Fine. I�ll go for the safe place. :-()-: Have the money ready by 11am. We�ll call you.the lost son1999["what","sort","puppy","looking","what","sort","have","pups","bitches","from","three","twelve","months","trained","untrained","ones","white","brown","ones","understand","yeah","also","provide","hour","after","sale","service","were","puppy","fall","sick","accidently","would","unburden","understand","good","good","what","looking","what","about","untrained","white","much","hurry","tomorrow","afraid","only","pups","currently","available","such","notice","brown","trained","they","very","cheerful","have","been","thoroughly","checked","diseases","much","fifteen","straight","delivery","twenty","with","provision","safe","place","visitors","tend","find","second","option","more","convenient","fine","safe","place","have","money","ready","11am","call"]
29180Delly, shhhhhh... :-()-: No... I can't... I have to... I can't...the majestic2001["delly","shhhhhh","have"]
74145Now who's the dreamer, Superman? Even you can't fly that fast! :-()-: We'll see how fast I can fly.superman1978["dreamer","superman","even","that","fast","fast"]
61223Victory is mine. I thank thee O Lord that in thy ... :-()-: Come on then. :-()-: What?monty python and the holy grail1975["victory","mine","thank","thee","lord","that","come","then","what"]
61567I understand. I saw the dress. I...I'm sorry. Are you all right? :-()-: There was an accident. I came here.mulholland dr.2001["understand","dress","sorry","right","there","accident","came","here"]
1404There goes your ride. :-()-: Let my daughter go or I'll take you out! :-()-: If you put down the gun, I promise not to drop her on the way down.air force one1997["there","goes","your","ride","daughter","take","down","promise","drop","down"]
1410The northern border's gotten a bit hairy. Their MiGs are playing tag with our Tomcats and our boys are just itching to engage. :-()-: Tell our boys to cool their jets. I don't need `em creating policy for me.air force one1997["northern","border","gotten","hairy","their","migs","playing","with","tomcats","boys","just","itching","engage","tell","boys","cool","their","jets","need","creating","policy"]
16791Mr. Dokos says that your father missed his height envelope by six inches. :-()-: He wants the entire roof taken off and lowered.life as a house2001["dokos","says","that","your","father","missed","height","envelope","inches","wants","entire","roof","taken","lowered"]
31919Heather? :-()-: Chase. Hi...new nightmare1994["heather","chase"]
39543Hi. My name is Violet. We sort of met in the elevator -- :-()-: Yeah, sure. I'm Corky. :-()-: I heard you working in here and I just wondered if you'd like a cup of coffee?bound1996["name","violet","sort","elevator","yeah","sure","corky","heard","working","here","just","wondered","like","coffee"]
50111Who is that guy? :-()-: Policy man in Queens. :-()-: What about the last of the big-time spenders. You make him?the french connection1971["that","policy","queens","what","about","last","time","spenders","make"]
11550Listen, Niki. My daughter's been missing five months. I've gone through a lot to find out what's happened to her. I just saw a girl killed. I will not let Tod slip out of my hands. You have to tell me where he is. :-()-: But then you'll forget about me. :-()-: Where is he, Niki?hardcore1979["listen","niki","daughter","been","missing","five","months","gone","through","find","what","happened","just","girl","killed","will","slip","hands","have","tell","where","then","forget","about","where","niki"]
33729Please don't have me arrested, please! I didn't steal anything - you can search me! :-()-: How did you get in here? :-()-: I hid outside in the hall till the maid came to turn down your bed. She must've forgot something and when she went to get it, she left the door open. I sneaked in and hid till she finished. Then I just looked around - and pretty soon I was afraid somebody'd notice the lights were on so I turned them off - and then I guess, I fell asleep. :-()-: You were just looking around... :-()-: That's all. :-()-: What for? :-()-: You probably won't believe me. :-()-: Probably not. :-()-: It was for my report. :-()-: What report? To whom? :-()-: About how you live, what kind of clothes you wear - what kind of perfume and books - things like that. You know the Eve Harrington clubs - that they've got in most of the girls' high schools? :-()-: I've heard of them. :-()-: Ours was one of the first. Erasmus Hall. I'm the president. :-()-: Erasmus Hall. That's in Brooklyn, isn't it? :...all about eve1950["please","have","arrested","please","didn","steal","anything","search","here","outside","hall","till","maid","came","turn","down","your","must","forgot","something","when","went","left","door","open","sneaked","till","finished","then","just","looked","around","pretty","soon","afraid","somebody","notice","lights","were","turned","them","then","guess","fell","asleep","were","just","looking","around","that","what","probably","believe","probably","report","what","report","whom","about","live","what","kind","clothes","wear","what","kind","perfume","books","things","like","that","know","harrington","clubs","that","they","most","girls","high","schools","heard","them","ours","first","erasmus","hall","president","erasmus","hall","that","brooklyn","lots","actresses","come","from","brooklyn","barbara","stanwyck","susan","hayward","course","they","just","movie","stars"]
33735So there you are. It seemed odd, suddenly, your not being there... :-()-: Why should you think I wouldn't be? :-()-: Why should you be? After all, six nights a week - for weeks - of watching even Margo Channing enter and leave a theater- :-()-: I hope you don't mind my speaking to you... :-()-: Not at all. :-()-: I've seen you so often - it took every bit of courage I could raise- :-()-: To speak to just a playwright's wife? I'm the lowest form of celebrity... :-()-: You're Margo Channing's best friend. You and your husband are always with her - and Mr. Sampson... what's he like? :-()-: Bill Sampson? He's - he's a director. :-()-: He's the best. :-()-: He'll agree with you. Tell me, what do you between the time Margo goes in and comes out? Just huddle in that doorway and wait? :-()-: Oh, no. I see the play. :-()-: You see the play? You've seen the play every performance? But, don't you find it - I mean apart from everything else - don't you find it expensive? :-()-: Standing room does...all about eve1950["there","seemed","suddenly","your","being","there","should","think","wouldn","should","after","nights","week","weeks","watching","even","margo","channing","enter","leave","theater","hope","mind","speaking","seen","often","took","every","courage","could","raise","speak","just","playwright","wife","lowest","form","celebrity","margo","channing","best","friend","your","husband","always","with","sampson","what","like","bill","sampson","director","best","agree","with","tell","what","between","time","margo","goes","comes","just","huddle","that","doorway","wait","play","play","seen","play","every","performance","find","mean","apart","from","everything","else","find","expensive","standing","room","doesn","cost","much","manage"]
50408Now you better cool out a minute, boy. You already almost got your head blown to pieces. :-()-: Will you listen, dammit! :-()-: Don't piss me off, junior. Or I will repaint this office with your brains.jason lives: friday the 13th part vi1986["better","cool","minute","already","almost","your","head","blown","pieces","will","listen","dammit","piss","junior","will","repaint","this","office","with","your","brains"]
1482Elaine, I'm going back there. Just hold onto that stick and try to control this hunk of tin as best you can. :-()-: Ted, please be careful.airplane ii: the sequel1982["elaine","going","back","there","just","hold","onto","that","stick","control","this","hunk","best","please","careful"]
58256Sam the Man. :-()-: Hey, Ben. Thanks for coming down.lone star1996["thanks","coming","down"]
63856Could it be some kind of college initiation? :-()-: It's an initiation all right, but not of a college as you and I know them. Nothing alive looks like that! :-()-: Can't we get out of here? :-()-: I'm not sure... :-()-: What do you mean? :-()-: I'm not sure, myself. It's just a feeling I've had since the crash...Like I feel a cold chill all over.. ..Now this!orgy of the dead1965["could","some","kind","college","initiation","initiation","right","college","know","them","nothing","alive","looks","like","that","here","sure","what","mean","sure","myself","just","feeling","since","crash","like","feel","cold","chill","over","this"]
16241I don't know. :-()-: I think it's a different place for each person. :-()-: Did you have a dream?labor of love1998["know","think","different","place","each","person","have","dream"]
57930So what is it, Jack? What brings you up here? :-()-: A French & Indian army out of Fort Carillon's heading south to war against the English. I'm here to raise this county's militia to aid the British defense. :-()-: Folks here goin' to join in that fight? :-()-: We'll see in the morning...last of the mohicans1977["what","jack","what","brings","here","french","indian","army","fort","carillon","heading","south","against","english","here","raise","this","county","militia","british","defense","folks","here","goin","join","that","fight","morning"]
3319With grenadine, right? :-()-: When I was twenty. :-()-: Oooh, very sophisticated. Having fun?backdraft1991["with","grenadine","right","when","twenty","oooh","very","sophisticated","having"]
72563Daddy was washing Rachel. In the shower. What did you think that was about? :-()-: Sex. Of course.stepmom1998["daddy","washing","rachel","shower","what","think","that","about","course"]
52796What's the pumpkin for? :-()-: I brought it for Tommy. I figured making a Jack-O-Lantern would keep him occupied. :-()-: I always said you'd make a fabulous girl scout. :-()-: Thanks. :-()-: For that matter, I might as well be a girl scout tonight. I plan on making popcorn and watching Doctor Dementia. Six straight hours of horror movies. Little Lindsey Wallace won't know what hit her.halloween1978["what","pumpkin","brought","tommy","figured","making","jack","lantern","would","keep","occupied","always","said","make","fabulous","girl","scout","thanks","that","matter","might","well","girl","scout","tonight","plan","making","popcorn","watching","doctor","dementia","straight","hours","horror","movies","little","lindsey","wallace","know","what"]
48399This isn�t my real life. It�s just a glimpse... :-()-: Where�s my real dad? :-()-: I don�t know...the family man2000["this","real","life","just","glimpse","where","real","know"]
71496I'm looking for a man. :-()-: What kind of man? :-()-: A bowler.spare me1992["looking","what","kind","bowler"]
54028Donuts here any good? :-()-: I don't eat junk food.hostage2005/I["donuts","here","good","junk","food"]
58534Did she know where Nix was buried? :-()-: No. :-()-: Who else did? Did Valentin? :-()-: Yes. :-()-: Jesus!lord of illusions1995["know","where","buried","else","valentin","jesus"]
79299Oh, Mr. Donowitz - :-()-: Lee, Clarence . Please don't insult me. Call me Lee. :-()-: OK, sorry, Lee. I just wanna tell you "Coming Home in a Body Bag" is one of my favorite movies. After "Apocalypse Now" I think it's the best Vietnam movie ever. :-()-: Thank you very much, Clarence. :-()-: You know, most movies that win a lot of Oscars, I can't stand. "Sophie's Choice", "Ordinary People", "Kramer vs. Kramer", "Gandhi". All that stuff is safe, geriatric, coffee-table dog shit. :-()-: I hear you talkin' Clarence. We park our cars in the same garage. :-()-: Like that Merchant-Ivory clap-trap. All those assholes make are unwatchable movies from unreadable books.true romance1993["donowitz","clarence","please","insult","call","sorry","just","wanna","tell","coming","home","body","favorite","movies","after","apocalypse","think","best","vietnam","movie","ever","thank","very","much","clarence","know","most","movies","that","oscars","stand","sophie","choice","ordinary","people","kramer","kramer","gandhi","that","stuff","safe","geriatric","coffee","table","shit","hear","talkin","clarence","park","cars","same","garage","like","that","merchant","ivory","clap","trap","those","assholes","make","unwatchable","movies","from","unreadable","books"]
83051What? :-()-: I say, it wouldn't be fair to you... or to me. :-()-: Nor to Elizabeth. :-()-: No. Nor to Elizabeth. :-()-: We all have our feelings. I know that I have mine. And... I wouldn't want to hurt yours.young frankenstein1974["what","wouldn","fair","elizabeth","elizabeth","have","feelings","know","that","have","mine","wouldn","want","hurt","yours"]
41452Hey -- :-()-: What is it now? :-()-: You're going to have to do it, aren't you? :-()-: Do what? :-()-: Kill me.the crying game1992["what","going","have","aren","what","kill"]
6363What about... The money? :-()-: What about this situation makes you think I can answer that question right now?confidence2003["what","about","money","what","about","this","situation","makes","think","answer","that","question","right"]
29976How long's it been? :-()-: Little over two hours.the thing1982["long","been","little","over","hours"]
29914I suppose... well, it's possible someone might have lifted it from me. But... :-()-: That key ring of yours is always hooked to your belt. Now how could somebody get to it without you knowing? :-()-: Look, I haven't been near that... that refrigerator.the thing1982["suppose","well","possible","someone","might","have","lifted","from","that","ring","yours","always","hooked","your","belt","could","somebody","without","knowing","look","haven","been","near","that","that","refrigerator"]
10652I want to come home, of course I do, I'd have to be mad not to want that. It's just that Marcus trusts me. :-()-: Let him trust Quintus. :-()-: Quintus is overly idealistic. :-()-: I never knew a more idealistic man than you. :-()-: Me? Well, I believe in Rome... you'd have to after what I've seen, how people outside the empire treat each other. :-()-: I don't even want to imagine the things you've seen... :-()-: What you don't want to imagine is the things I've done.gladiator2000["want","come","home","course","have","want","that","just","that","marcus","trusts","trust","quintus","quintus","overly","idealistic","never","knew","more","idealistic","than","well","believe","rome","have","after","what","seen","people","outside","empire","treat","each","other","even","want","imagine","things","seen","what","want","imagine","things","done"]
82023Can you fill me in here? :-()-: Sure. We have no idea what's going on. :-()-: Thank you. :-()-: Come on, let's at least see if we can find Dr. Pemberton. :-()-: You go ahead. I'll stick with Loveless.wild wild west1999["fill","here","sure","have","idea","what","going","thank","come","least","find","pemberton","ahead","stick","with","loveless"]
40483And Spring Fling. :-()-: Okay.buffy the vampire slayer1992["spring","fling","okay"]
22121My old lady swallowed a bottle of pills one day while I was at school. :-()-: God. :-()-: The thing that really got to me... she didn't leave a note. Nothing. I've always hated her for that. :-()-: Does it still hurt? :-()-: Naw. You're alone in this world no matter what kinda folks or background you had. Nothing hurts, pard, once you got that one down.an officer and a gentleman1982["lady","swallowed","bottle","pills","while","school","thing","that","really","didn","leave","note","nothing","always","hated","that","does","still","hurt","alone","this","world","matter","what","kinda","folks","background","nothing","hurts","pard","once","that","down"]
33374Now what's wrong? :-()-: I've completely lost their signal. :-()-: Can you get them back? :-()-: I'm trying.alien1979["what","wrong","completely","lost","their","signal","them","back","trying"]
22207Cell phone. :-()-: Shit!panic room2002["cell","phone","shit"]
16716The Avatar. I like the sound of it. :-()-: Sigurd the Volsung slew Fafnir with that blade... See the line where Regin welded the break?legend1985["avatar","like","sound","sigurd","volsung","slew","fafnir","with","that","blade","line","where","regin","welded","break"]
62229And I was thinking this could be our last time. Alone. Together. You know? :-()-: Except for the hot affairs we'll have twice a year. :-()-: Except for that.my best friend's wedding1997["thinking","this","could","last","time","alone","together","know","except","affairs","have","twice","year","except","that"]
69639They get hair all over the place. :-()-: They're Yorkies and they don't shed.shampoo1975["they","hair","over","place","they","yorkies","they","shed"]
43891What happened? :-()-: They sort of got away. :-()-: I see. Well, get back out on the street and find them before I "sort of" kill you.crime spree2003["what","happened","they","sort","away","well","back","street","find","them","before","sort","kill"]
63415Big king, too bad... :-()-: Just wait till you hear... :-()-: Hear what? :-()-: McMurphy killed two attendants and escaped... :-()-: When? :-()-: Yesterday... :-()-: Who told you that? :-()-: Gary Blinker...one flew over the cuckoo's nest1975["king","just","wait","till","hear","hear","what","mcmurphy","killed","attendants","escaped","when","yesterday","told","that","gary","blinker"]
56420What are you doing, Travis? :-()-: I been told to take your car in, Sir. :-()-: Why? :-()-: I dunno, Sir. Brought you up a Chevy.jennifer eight1992["what","doing","travis","been","told","take","your","dunno","brought","chevy"]
81054With their berets... :-()-: ...their Leopard Skin Berets....wag the dog1997["with","their","berets","their","leopard","skin","berets"]
34391AND A SONG SOMEONE SINGS ONCE UPON A DECEMBER. :-()-: Who are you?!anastasia1997["song","someone","sings","once","upon","december"]
57048Ruby, come on. You witnessed a brutal triple murder and you're having trouble accepting it. Think about what you're saying. You really expect a jury to believe that Jason has a mystery killer living in his tummy? :-()-: I know how this sounds, but that's what happened.freddy vs. jason2003["ruby","come","witnessed","brutal","triple","murder","having","trouble","accepting","think","about","what","saying","really","expect","jury","believe","that","jason","mystery","killer","living","tummy","know","this","sounds","that","what","happened"]
851They just landed in the desert. :-()-: How much time is left?the fifth element1997["they","just","landed","desert","much","time","left"]
3879What? :-()-: Seven guys. What was it you said? You were "just starting to believe I wasn't the guy people said".basic2003["what","seven","guys","what","said","were","just","starting","believe","wasn","people","said"]
61426No. You are mistaken. Prince Albert, my husband, had typhoid fever. I asked what was wrong with my son. :-()-: The same, your Majesty.mrs brown1997["mistaken","prince","albert","husband","typhoid","fever","asked","what","wrong","with","same","your","majesty"]
7060Ow! :-()-: Would yous boys excuse us a second? Loretta, you too.drop dead gorgeous1999["would","yous","boys","excuse","second","loretta"]
77201What kind of mine? :-()-: I don't know, and I wasn't about to mess with it. :-()-: Should have blown already. Delayed fuse, that's Vietnam stuff. :-()-: Maybe that's all the Iraqis could afford, okay? Maybe they got it on discount. Maybe the fuse is messed up. Or maybe it's going to go off in two seconds, and we won't have to worry about getting Jaeger down off there, all we'll have to worry about is finding the pieces.three kings1999["what","kind","mine","know","wasn","about","mess","with","should","have","blown","already","delayed","fuse","that","vietnam","stuff","maybe","that","iraqis","could","afford","okay","maybe","they","discount","maybe","fuse","messed","maybe","going","seconds","have","worry","about","getting","jaeger","down","there","have","worry","about","finding","pieces"]
38786What time are visiting hours? :-()-: I've made arrangements with Dr. Gynde for 10:30. But Jeffrey, you'll have to walk over; I need the car this morning. :-()-: Well. Okay. :-()-: Jeffrey, when you see your father. :-()-: Yeah? :-()-: He doesn't know you're out of school. He thinks it's a vacation for you. :-()-: What? :-()-: It would be too much for him. So please let him think as he does, that you're home just to see him. :-()-: Thanks a lot, Mom. :-()-: .Jeffrey!. Nobody wanted you to leave school and go to work in the store. maybe going back to school will be an option one day. I hope so.bloodmoon1997["what","time","visiting","hours","made","arrangements","with","gynde","jeffrey","have","walk","over","need","this","morning","well","okay","jeffrey","when","your","father","yeah","doesn","know","school","thinks","vacation","what","would","much","please","think","does","that","home","just","thanks","jeffrey","nobody","wanted","leave","school","work","store","maybe","going","back","school","will","option","hope"]
7787Oh my god OH MY GOD... :-()-: Starck!event horizon1997["starck"]
63916You're just jealous it was me in the trunk with her and not you. :-()-: You're right.out of sight1998["just","jealous","trunk","with","right"]
50692You mean Gandhi? :-()-: Back in South Africa... long time ago. :-()-: What was he like? :-()-: Lots of hair... and a little like a college freshman -- trying to figure everything out. :-()-: Well, he must've found some of the answers...gandhi1982["mean","gandhi","back","south","africa","long","time","what","like","lots","hair","little","like","college","freshman","trying","figure","everything","well","must","found","some","answers"]
48749Another delay... With only forty-two minutes left. :-()-: It'll be close -- but there's still a margin of safety. :-()-: Let's find what the devil's holding them up! Contact the Proteus!fantastic voyage1966["another","delay","with","only","forty","minutes","left","close","there","still","margin","safety","find","what","devil","holding","them","contact","proteus"]
76052Where did you two meet? :-()-: In a lake. :-()-: I might have known. As I was telling you earlier, I'm the world champion free diver. :-()-: Congratulations. :-()-: Some people say it's the most virile sport in the world. One has to admit that when you see those men diving head first in that deep blue sea, all muscles contracted in one super human effort...le grand bleu1988["where","meet","lake","might","have","known","telling","earlier","world","champion","free","diver","congratulations","some","people","most","virile","sport","world","admit","that","when","those","diving","head","first","that","deep","blue","muscles","contracted","super","human","effort"]
76040You're leaving? :-()-: Yes... Could you please give this to Enzo. :-()-: Of course.le grand bleu1988["leaving","could","please","give","this","enzo","course"]
3025Ah ... From Trubshaw's. My shoemaker. :-()-: A kipper. Or a red herring? What were they investigating?the avengers1998["from","trubshaw","shoemaker","kipper","herring","what","were","they","investigating"]
display(tokenized_df.sample(false,0.001,123L).select("tokens"))
["flight","full","sorry","believe","flight","closed","please","check","full","please","could","check"]
["wants","money","give","money","tell","wants","money","from","your","bank","fuckin","give","down","there","understand","wants","money","that","your","bank","eight","hundred","pounds","jesus"]
["kittle","where","been","hiding","grease","fist","been","looking"]
["defendant","tied","deceased","boat","with","that","last","would","those","cleats","have","lined","miyamoto","there","been","hurry","cast","coulda","left","this","line","behind","carl","boat","replaced","later","with","that","your","inference","pretty","darn","clear"]
["malkovich","name","craig","schwartz","explain","operate","little","business","that","simulates","clientele","experience","being","actually","simulates","sure","after","fashion","sure","would","pale","comparison","actual","experience"]
["estimating","genesis","hours","present","speed","hold","speed","scott"]
["this","jack","groppi","place","yeah","here","know","where","follow"]
["shall","make","stay","with","perhaps","knew","knew","would","love","more","than","waking","world","there","more","than","that","perhaps","show","that","could","lavished","affection","there","doubt","about","that","life","very","different","with","madame","claudia","imagine"]
["well","that","expected","they","have","understand","these","start","problems","this","pots","pans","this","precise","business","write","them","letter","they","withholding","payment","well","sure","would","would","wouldn","worry","about","right","these","days"]
["seems","were","last","carlton","alive","last","gregory","alive","what","makes","think","last"]
["that","same","chick","other","jennifer","gave","make","over","looks","like","helluva","more","than","make","over","there","surgery","involved"]
["kiss","booty","plan"]
["where","know","amanda","from","from","around","live","around","here","yeah","gotta","girlfrlend"]
["what","what","this","place","like","scumbag","yard","sale"]
["damn","this","better","smooth","like","takin","candy","from","fuckin","baby"]
["have","boyfriend","really","interesting","seeing","anyone","mean","seriously","maybe","know","really","bateman","opens","cupboard","where","there","very","bateman","opens","cupboard","where","there","neatly","ordered","weapons","rifle","chain","duct","tape","twine","nail","jean","feel","fulfilled","mean","your","life","well","guess","long","time","focused","work","think","really","begun","think","about","changing","myself","know","developing","growing","growing","glad","said","that"]
["right","shit","peggy","going","snap","couldn","help","loved","three","awake"]
["where","friend","coming"]
["being","street","little","different","than","sitting","office","enjoying","yourself","know","this","game","parker","tell"]
["late","didn","much","sleep","well"]
["honey","what","wrong","what","just","have","only","review","orson","welles","when","made","citizen","kane","already","still","young","this","part","your","life","when","supposed","struggling","know","sometimes","scared","this","good","gonna"]
["what","last","name","healy"]
["must","nice","take","days","from","your","work","well","have","more","than","days","sort","editor","anymore","right","first","time","said","loud","they","fired","more","like","leave","yeah","they","fired","seem","upset","delayed","shock","maybe","know","could","talk","back","wanted","another","magazine","someplace","just","sure","want","guess","have","figure","until","home"]
["still","believe","good","better","left","wanted","career","didn","ever","want","part","something","special"]
["where","your","warders","lying","gutter","where","they","belong"]
["assholes","this","happen","twice","year"]
["take","what","people","make","ugly","make","others","believe","what","want","them","should","have","been","found","guilty","shouldn","have","gotten","then","would","have","gotten","your","money","killed","didn","joanne"]
["what","want","know","everything","start","beginning","born","outside","london","only","minister","master","harrow","grandfather","bishop","church","church"]
["other","lancelot","lancelot"]
["wait","minute","could","take","jesus","slower","than","hell","yeah","weighs","better","than","thirty","tons","they","could","stop"]
["this","lady","last","stop","just","with","guys","sorry","business","place","lady"]
["home","after","lewton","they","after","takin","cabin","woods","only","couple","miles","from","house","keep","highways","they","lookin"]
["this","what","consumer","recreation","services","their","building","they"]
["juice","looks","like","been","laid","years","might","able","adapt","shut"]
["marks","bones","wasn","natural","quiet","please","everyone"]
["afaid","there","going","about","denying","permission","land","direct","violation","convention","well","hope","crew","back","safely","fortunately","they","well","glad","about","that"]
["damn","what","hell","matter","other","people","have","birthdays","treating","yours","like","funeral","bones","want","lectured","what","want","damn","there","girl","here","know","this","nothing","with","this","about","flying","goddamn","computer","console","when","wanna","hopping","galaxies","spare","your","notions","poetry","please","have","assigned","duties","bull","hiding","hiding","behind","rules","regulations","hiding","from","from","yourself","admiral"]
["serious","think","what","think","think"]
["everybody","keeps","starin","yeah","know","what","mean","what","know","look","good","dressed","clean","real","nice","sure","down","boulevard","lookin","like","that","ever","anyway","think","here","where","they","just","about","chew","your","food","where"]
["what","time","diner","tonight","eight","clock","kross","bringing","very","bright","watch","yourself","with","this","girl","taking","know"]
["ever","have","problems","have","tell","about","would","leave","apartment","three","times","rainy","night","with","suitcase","come","back","three","times","likes","wife","welcomes","home","that","salesman","wife","didn","work","today","homework","more","interesting","what","interesting","about","butcher","knife","small","wrapped","newspaper","nothing","thank","heaven","hasn","gone","into","wife","bedroom","wouldn","dare","answer","that","lisa","there","something","terribly","wrong"]
["house","call","cruiser"]
["haven","seen","since","back","attitude","that","girl","turned","down","bunch","money","would","been","great","angle","beauty","beast","must","getting","thought","something","going","there"]
["perfect","child","yeah"]
["jack","mean","from","what","tell","operation","already","over","extended","sales","price","advantage","could","offer","would","easily","matched","larger","supplier"]
["like","each","other","heads","kate","wife"]
["like","hang","with","murderers","what","know","roxy","course","knew"]
["thought","wanted","writer","then","write","anywhere","here","while","still","carrie","come","leave","anytime","never","leave","once","start","talking","about","tenure","vacation","parking","privileges","shit","just","california","right","before","late","just","like","that","just","like","that","load","lincoln","point","west","stop","when","fucking","ocean"]
["answer","sidalee","trying","take","hook","just","good","example","press","agent","eats","columnists","dirt","expected","call","manna"]
["hello","steve","glad","could","come","call","would","coffee","there"]
["soft","supple","like","lady","moisturize","regularly"]
["last","thing","need","that","exactly","what","said","before","said","great","sensational","idea"]
["fugasi","corporate","papers","have","legit","gotta","score","clean","talk","suits","gotta","banker"]
["they","just","pushed","schedule","skywire","apps","fast","going","there","second","place","plus","every","time","jammed","gary","inspiration","like","that","with","your","counselor","mine","barely","remembers","take","shower","right","right","does","ever","just","like","hand","code","maybe","once","wrote","anyway","compulsive","more","like","have","little","trouble","trusting","people","that","long","story","that","interesting"]
["fail","little","christianity","could","hurt","anyone","here","anyway","just","showing","this","example","what","available","didn","think","interested"]
["guys","what","sitch","bored","what","think","please"]
["guys","from","movie","hate","guys"]
["louise","where","going","oklahoma","city","jimmy","gonna","wire","some","money","then","talked","tell","didn","tell","that","something","gotta","straight","darryl","been","callin","hornet","makin","kinds","noise","when","talk","cannot","anything","about","this","gotta","make","sure","everything","sounds","normal","called","asshole","morning","wasn","even","home","know","what","about","should","been","tellin","that","last","years","think","darryl","having","affair","think","darryl","mature","enough","conduct","affair","think","fools","around","thelma","going","mexico","think","make","half","days","going","have","haul","this","mean","have","know","this","game","deep","shit","gotta","know","what","gonna","know","know","what","askin","fall","apart","goddamnit","thelma","every","time","trouble","blank","plead","insanity","some","such","shit","this","time","this","time","everything","changed","whatever","want","going","mexico","going","coming","with"]
["what","expect","speeches","mean","expect","anything","minute","hadn","cuite","waked"]
["know","mistake","wasn","intentional","would","want","hurt","them","know"]
["what","want","talking","that","what","said","what","think","said"]
["think","about","chance","caught","motive","like","that","could","divide","jury","years","think","took","mother","took","yours","sympathy","factor","maternal","abandonment","causes","serious","deviant","behavior","certainly","fucked","made","have","with","psychopath","that","right","that","longer","virgin","gotta","those","rules"]
["what","exactly","they","they","said","hundred","thirty","fifth","twelfth","they","didn","address","told","what","they","said","nothing","else","nothing","they","know","were","they","asked","they","said","more","than","address","they","asked","then","told","what","corner","this","bullshit","what","fuck"]
["alright","honey","just","calm","down","take","deep","breath","step","circle","would","psycho","babble","bullshit","there","pictures","internet"]
["after","family","worked","little","piece","land","near","savannah","while","down","there","then","well","they","made","hard","every","they","could","finally","daddy","figured","promised","land","this","direction","that","time","sick","farming","didn","want","touch","another","ever","wouldn","come","with","daddy","took","pretty","hard","little","sister","headed","without","they","little","place","south","silverado","guess","they","done","okay","good","enough","anyway","that","when","wrote","last","time","said","they","needed","help","work","place","that","almost","nine","months","wrote","letter","took","while","find","when","just","right","time","where","were","chicago","working","slaughter","houses"]
["that","that","list","what","matter","know","this","trick"]
["been","back","dallas","going","what","exactly","trying","prove","that","bombing","dallas","have","been","destroy","bodies","those","firemen","their","deaths","reason","them","wouldn","have","explained","those","very","serious","allegations","agent","scully","know"]
["what","choices","about","hundred","miles","nothing","each","direction","where","would","they","going","choices","them","wrong"]
["that","little","weasel","ever","walked","here","wouldn","serve","slap","face","kick","nuts","thought"]
["want","facts","indy","have","none","give","prepared","take","things","faith","call","donovan","marcus","tell","take","that","ticket","venice","tell","take"]
["that","moron","honest","mistake","ridgeway","ridgeroad","ridgeway","road","everyone","some","sleep","leaving","morning"]
["sure","fine","really","okay","sleep","morning","good","night","good","night"]
["well","sure","that","speaks","very","well","your","parents","forced","choose","between","security","country","security","your","which","would","pick","while","hesitate","permit","suggest","that","they","same","your","country","your","doing","brean","that","what","doing","here","what","thought","were","doing"]
["what","doing","looking","phone","think","that","watch","think","that","stanley","watch","stanley","stanley","knew","stanley","knew","deal","when","signed","deal","changed","deal","changed","deal","changed","what","money","money","want","money","money","think","this","money","this","credit","credit","paalll","always","knew","couldn","take","credit","that","thing","gonna","dickheads","from","filmschool","take","nuts","nuts"]
["here","month","increase","advance","needn","speak","about","proud","woman","again","another","months","course","keep","like","hell","hell","give","orders","watch","your","manners","your","sicilian","street","there"]
["never","brought","anyone","down","here","before","honored","there","something","about","what","told","other","night","head","driving","batty","know","could","trust","completely","reliable"]
["have","best","answer","everything","seem","hopeful","always","this","sunny","ever","thought","must","bring","contagious","cause","everyone","agrees","immune","system","down","maybe","catch","something","didn","tell","didn","think","good","idea","come","patient","what","changed"]
["take","your","clothes","mike","first","about","both","same","time"]
["know","location","drop","maybe","half","before","that","location","gets","grape","vined","rest","world","gets","hipped","that","already","happened","hoss","naive","think","otherwise"]
["troops","here","that","rock","roll","detective","told","about","hebedeebuh","hebedeebuh","maybe","explosion"]
["what","proud","american"]
["have","board","bedroom","doors","where","going","sleep","family","room"]
["alright","sometimes","when","feel","weak","have","these","visions","what","mean","things","worst","fears","need","know","something","peter","possessed","once","like","birdson","took","father","lareaux","days","pull"]
["what","betty","have","mice","mean","ralph"]
["needed","break","where","about","hour","away","believe","slept","were","tired","here","twenty","thousand","like","throw","breakfast","what","dream","about","dream","asleep","dream","that","asleep","wake","think","smoke"]
["that","gives","idea","what","against","against","more","than","that","with","that","nutty","slogan","invented","reform","reds","with","rope"]
["thank","eventually","puff","well","thank","because","speaking","sort","thank","with","special","look","look","gives","know","loves","what","enchanting","picture","paint","future","together"]
["crapper","with","those","antenna","phones","sounds","like","taking","dump","size","butte","montana","bullworker","anyway","listen","they","gone","what","gone","kiss","tickets","nimrod","they","just","fuckin","gone","please","tell","have","gone","would","have","kiss","tick","just","check","whatever","were","wearing","last","night"]

Step 4. Remove Stopwords

We can easily remove stopwords using the StopWordsRemover(). See:

If a list of stopwords is not provided, the StopWordsRemover() will use this list of stopwords, also shown below, by default.

are,around,as,at,back,be,became,because,become,becomes,becoming,been,before,beforehand,behind,being,below,beside,besides,between,beyond,bill,both,bottom,but,by,call,can,cannot,cant,co,computer,con,could,
couldnt,cry,de,describe,detail,do,done,down,due,during,each,eg,eight,either,eleven,else,elsewhere,empty,enough,etc,even,ever,every,everyone,everything,everywhere,except,few,fifteen,fify,fill,find,fire,first,
five,for,former,formerly,forty,found,four,from,front,full,further,get,give,go,had,has,hasnt,have,he,hence,her,here,hereafter,hereby,herein,hereupon,hers,herself,him,himself,his,how,however,hundred,i,ie,if,
in,inc,indeed,interest,into,is,it,its,itself,keep,last,latter,latterly,least,less,ltd,made,many,may,me,meanwhile,might,mill,mine,more,moreover,most,mostly,move,much,must,my,myself,name,namely,neither,never,
nevertheless,next,nine,no,nobody,none,noone,nor,not,nothing,now,nowhere,of,off,often,on,once,one,only,onto,or,other,others,otherwise,our,ours,ourselves,out,over,own,part,per,perhaps,please,put,rather,re,same,
see,seem,seemed,seeming,seems,serious,several,she,should,show,side,since,sincere,six,sixty,so,some,somehow,someone,something,sometime,sometimes,somewhere,still,such,system,take,ten,than,that,the,their,them,
themselves,then,thence,there,thereafter,thereby,therefore,therein,thereupon,these,they,thick,thin,third,this,those,though,three,through,throughout,thru,thus,to,together,too,top,toward,towards,twelve,twenty,two,
un,under,until,up,upon,us,very,via,was,we,well,were,what,whatever,when,whence,whenever,where,whereafter,whereas,whereby,wherein,whereupon,wherever,whether,which,while,whither,who,whoever,whole,whom,whose,why,will,
with,within,without,would,yet,you,your,yours,yourself,yourselves

You can use getStopWords() to see the list of stopwords that will be used.

In this example, we will specify a list of stopwords for the StopWordsRemover() to use. We do this so that we can add on to the list later on.

display(dbutils.fs.ls("dbfs:/tmp/stopwords")) // check if the file already exists from earlier wget and dbfs-load
dbfs:/tmp/stopwordsstopwords2237

If the file dbfs:/tmp/stopwords already exists then skip the next two cells, otherwise download and load it into DBFS by uncommenting and evaluating the next two cells.

%sh 
wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words -O /tmp/stopwords # uncomment '//' at the beginning and repeat only if needed again
--2019-05-31 08:23:58-- http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words Resolving ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)... 130.209.240.253 Connecting to ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)|130.209.240.253|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 2237 (2.2K) Saving to: ‘/tmp/stopwords’ 0K .. 100% 320M=0s 2019-05-31 08:23:59 (320 MB/s) - ‘/tmp/stopwords’ saved [2237/2237]
%fs 
cp file:/tmp/stopwords dbfs:/tmp/stopwords 
res41: Boolean = true
// List of stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()
stopwords: Array[String] = Array(a, about, above, across, after, afterwards, again, against, all, almost, alone, along, already, also, although, always, am, among, amongst, amoungst, amount, an, and, another, any, anyhow, anyone, anything, anyway, anywhere, are, around, as, at, back, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, below, beside, besides, between, beyond, bill, both, bottom, but, by, call, can, cannot, cant, co, computer, con, could, couldnt, cry, de, describe, detail, do, done, down, due, during, each, eg, eight, either, eleven, else, elsewhere, empty, enough, etc, even, ever, every, everyone, everything, everywhere, except, few, fifteen, fify, fill, find, fire, first, five, for, former, formerly, forty, found, four, from, front, full, further, get, give, go, had, has, hasnt, have, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, him, himself, his, how, however, hundred, i, ie, if, in, inc, indeed, interest, into, is, it, its, itself, keep, last, latter, latterly, least, less, ltd, made, many, may, me, meanwhile, might, mill, mine, more, moreover, most, mostly, move, much, must, my, myself, name, namely, neither, never, nevertheless, next, nine, no, nobody, none, noone, nor, not, nothing, now, nowhere, of, off, often, on, once, one, only, onto, or, other, others, otherwise, our, ours, ourselves, out, over, own, part, per, perhaps, please, put, rather, re, same, see, seem, seemed, seeming, seems, serious, several, she, should, show, side, since, sincere, six, sixty, so, some, somehow, someone, something, sometime, sometimes, somewhere, still, such, system, take, ten, than, that, the, their, them, themselves, then, thence, there, thereafter, thereby, therefore, therein, thereupon, these, they, thick, thin, third, this, those, though, three, through, throughout, thru, thus, to, together, too, top, toward, towards, twelve, twenty, two, un, under, until, up, upon, us, very, via, was, we, well, were, what, whatever, when, whence, whenever, where, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom, whose, why, will, with, within, without, would, yet, you, your, yours, yourself, yourselves)
stopwords.length // find the number of stopwords in the scala Array[String]
res35: Int = 319

Finally, we can just remove the stopwords using the StopWordsRemover as follows:

import org.apache.spark.ml.feature.StopWordsRemover

// Set params for StopWordsRemover
val remover = new StopWordsRemover()
.setStopWords(stopwords) // This parameter is optional
.setInputCol("tokens")
.setOutputCol("filtered")

// Create new DF with Stopwords removed
val filtered_df = remover.transform(tokenized_df)
import org.apache.spark.ml.feature.StopWordsRemover remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_294e3228eba8 filtered_df: org.apache.spark.sql.DataFrame = [id: bigint, corpus: string ... 4 more fields]

Step 5. Vector of Token Counts

LDA takes in a vector of token counts as input. We can use the CountVectorizer() to easily convert our text documents into vectors of token counts.

The CountVectorizer will return (VocabSize, Array(Indexed Tokens), Array(Token Frequency)).

Two handy parameters to note:

import org.apache.spark.ml.feature.CountVectorizer

// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(10000) 
.setMinDF(5) // the minimum number of different documents a term must appear in to be included in the vocabulary.
.fit(filtered_df)
import org.apache.spark.ml.feature.CountVectorizer vectorizer: org.apache.spark.ml.feature.CountVectorizerModel = cntVec_48267a85f1b9
// Create vector of token counts
val countVectors = vectorizer.transform(filtered_df).select("id", "features")
countVectors: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]
// see the first countVectors
countVectors.take(1)
res38: Array[org.apache.spark.sql.Row] = Array([28762,(10000,[7,112,179,308],[1.0,1.0,1.0,1.0])])

To use the LDA algorithm in the MLlib library, we have to convert the DataFrame back into an RDD.

// Convert DF to RDD - ideally we should use ml for everything an not ml and mllib ; DAN
import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}

val lda_countVector = countVectors.map { case Row(id: Long, countVector: MLVector) => (id, Vectors.fromML(countVector)) }.rdd

import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover} import org.apache.spark.ml.linalg.{Vector=>MLVector} import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer} import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.sql.{Row, SparkSession} lda_countVector: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[3912] at rdd at command-753740454082286:11
// format: Array(id, (VocabSize, Array(indexedTokens), Array(Token Frequency)))
lda_countVector.take(1)
res42: Array[(Long, org.apache.spark.mllib.linalg.Vector)] = Array((28762,(10000,[7,112,179,308],[1.0,1.0,1.0,1.0])))

Create LDA model with Online Variational Bayes

We will now set the parameters for LDA. We will use the OnlineLDAOptimizer() here, which implements Online Variational Bayes.

Choosing the number of topics for your LDA model requires a bit of domain knowledge. As we do not know the number of "topics", we will set numTopics to be 20.

val numTopics = 20
numTopics: Int = 20

We will set the parameters needed to build our LDA model. We can also setMiniBatchFraction for the OnlineLDAOptimizer, which sets the fraction of corpus sampled and used at each iteration. In this example, we will set this to 0.8.

import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}

// Set LDA params
val lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(0.8))
.setK(numTopics)
.setMaxIterations(3)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer} lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@3c173c8

Create the LDA model with Online Variational Bayes.

val ldaModel = lda.run(lda_countVector)
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@5bf00930

Watch Online Learning for Latent Dirichlet Allocation in NIPS2010 by Matt Hoffman (right click and open in new tab)

Matt Hoffman's NIPS 2010 Talk Online LDA]

Also see the paper on Online varioational Bayes by Matt linked for more details (from the above URL): http://videolectures.net/site/normal_dl/tag=83534/nips2010_1291.pdf

Note that using the OnlineLDAOptimizer returns us a LocalLDAModel, which stores the inferred topics of your corpus.

Review Topics

We can now review the results of our LDA model. We will print out all 20 topics with their corresponding term probabilities.

Note that you will get slightly different results every time you run an LDA model since LDA includes some randomization.

Let us review results of LDA model with Online Variational Bayes, step by step.

val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)
topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(1, 2, 4, 49, 0),Array(0.0014102155338741765, 0.0012758924372910556, 0.0011214448310395873, 9.238914780871355E-4, 9.047647243869576E-4)), (Array(1, 6, 2, 0, 4),Array(0.0014443699497685366, 0.0012377629724506722, 0.0011714257476524842, 0.0010861657304183027, 8.604460434628813E-4)), (Array(1, 2, 8, 0, 3),Array(0.0014926060533533697, 0.0013429026076916017, 0.0013067364965238173, 0.0011607492289313303, 0.0011400804862230437)), (Array(5, 6, 4, 1, 7),Array(0.006717314446949222, 0.006002662754297925, 0.004488111770001314, 0.004408679383982238, 0.0042465917238892655)), (Array(0, 19, 3, 8, 6),Array(0.0050059173813691085, 0.0029731088780905225, 0.0022359962463711185, 0.002193246256785973, 0.0019111384839030116)), (Array(3, 0, 10, 1, 15),Array(0.003714410612506209, 0.0017122806517390608, 0.0017073041827440282, 0.0015712232707115927, 0.0012303967042097022)), (Array(0, 1, 6, 10, 2),Array(0.00467483294478972, 0.0038641828467113268, 0.003328578440542597, 0.002867941043688811, 0.002532629878316373)), (Array(0, 2, 9, 1, 13),Array(0.00960017865043255, 0.009308573745541343, 0.005704969701604644, 0.004085042285865179, 0.004031048471919761)), (Array(0, 4, 5, 77, 16),Array(0.004550808496981245, 0.004122146617438838, 0.0019092043643137734, 0.0018255598181846045, 0.001761167250972209)), (Array(6, 2, 5, 1, 0),Array(0.0016782125889211463, 0.0012427279906039904, 0.0012197157251243875, 0.0010635502545983016, 9.50137528050953E-4)), (Array(2, 1, 3, 0, 6),Array(0.003126597598330109, 0.0027451035751362273, 0.00228759303132256, 0.0017239166326848171, 0.0017047784964894794)), (Array(2, 1, 27, 4, 3),Array(0.004734133576359814, 0.004201386287998202, 0.0036983083453854372, 0.0025414887712607768, 0.002091795015523375)), (Array(0, 5, 1, 12, 2),Array(0.0035340054254694784, 0.002387182752907053, 0.0019263993964325303, 0.001843992584617911, 0.0018065489773133325)), (Array(2, 1, 5, 14, 0),Array(0.0016017017354850733, 0.0014834097260266685, 0.0014300356385979168, 0.001294952229819751, 0.0012788947989035501)), (Array(7, 1, 10, 6, 2),Array(0.002043769246809558, 0.0013757478946969802, 0.0013208455540129331, 0.0012662647575091633, 0.0011549537488969965)), (Array(0, 1, 2, 3, 4),Array(0.022087503347588935, 0.01571524947937798, 0.012895996754133662, 0.01026452087962411, 0.009873743305368164)), (Array(0, 1, 3, 4, 9),Array(0.002204551343207476, 0.0016283414468010306, 0.0014214537687803855, 0.0012768751041210551, 0.0011525954268574248)), (Array(46, 1, 2, 16, 5),Array(0.0022031979750387655, 0.0020637622110226085, 0.0019281346187348387, 0.0015712770524161123, 0.0014183600893726285)), (Array(0, 2, 3, 5, 8),Array(0.0035729889283848504, 0.0024215014894025766, 0.0018740761967851508, 0.001838630576321126, 0.0016262171049684524)), (Array(3, 10, 30, 9, 4),Array(0.0018098267577494882, 0.0015864305565599366, 0.0015861983258874525, 0.001331260635860306, 0.0012793651558771885)))
val vocabList = vectorizer.vocabulary
vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, stop, hear, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, today, school, meet, gone, crazy, wants, damn, forget, cause, problem, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, read, word, doctor, everybody, makes, supposed, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, open, happy, lost, john, hurt, town, ready, alright, late, actually, married, gave, beautiful, soon, jack, times, sleep, door, having, hand, drink, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, change, police, walk, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, inside, reason, stand, break, means, number, tried, high, white, water, suppose, body, sick, game, excuse, party, women, country, waiting, christ, answer, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, fact, million, hands, died, power, stupid, started, shouldn, months, city, boys, dinner, sense, running, hour, shoot, drive, fight, speak, george, ship, living, figure, dear, street, ahead, lady, seven, free, feeling, scared, frank, able, children, outside, moment, safe, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sounds, sister, lives, promise, till, sound, weren, save, poor, cool, asking, shall, plan, bitch, king, daughter, weeks, beat, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, tired, food, company, lucky, pull, wonderful, touch, state, looked, thinks, picture, leaving, words, control, clear, known, special, buddy, luck, order, follow, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, takes, ride, coffee, miles, parents, works, secret, explain, hotel, worse, kidding, past, outta, general, unless, felt, drop, throw, interested, hang, certainly, absolutely, earth, loved, wonder, dark, accident, seeing, simple, turned, doin, clock, date, sweet, meeting, clean, sign, feet, handle, army, music, giving, report, cops, fucked, charlie, information, smart, yesterday, fall, fault, class, bank, month, blow, major, caught, swear, paul, road, talked, choice, boss, plane, david, paid, wear, american, worried, clothes, paper, goodbye, lord, ones, strange, terrible, mistake, given, hurry, blue, finish, murder, kept, apartment, sell, middle, nothin, hasn, careful, meant, walter, moving, changed, imagine, fair, difference, quiet, happening, near, quit, personal, marry, future, figured, rose, agent, kinda, michael, building, mama, early, private, trip, watching, busy, record, certain, jimmy, broke, longer, sake, store, stick, finally, boat, born, sitting, evening, bucks, chief, history, ought, lying, kiss, honor, lunch, darling, favor, fool, uncle, respect, rich, liked, killing, land, peter, tough, interesting, brain, problems, nick, welcome, completely, dick, honest, wake, radio, cash, dude, dance, james, bout, floor, weird, court, calls, jail, window, involved, drunk, johnny, officer, needed, asshole, books, spend, situation, relax, pain, service, dangerous, grand, security, letter, stopped, realize, table, offer, bastard, message, instead, killer, jake, nervous, deep, pass, somethin, evil, english, bought, short, ring, step, picked, likes, voice, eddie, machine, lived, upset, forgot, carry, afternoon, fear, quick, finished, count, forgive, wrote, named, decided, totally, space, team, doubt, pleasure, lawyer, suit, station, gotten, bother, prove, return, pictures, slow, bunch, strong, wearing, driving, list, join, christmas, tape, attack, church, appreciate, force, hungry, standing, college, dying, present, charge, prison, missing, truck, public, board, calm, staying, gold, ball, hardly, hadn, lead, missed, island, government, horse, cover, reach, french, joke, star, fish, mike, moved, america, surprise, soul, seconds, club, self, movies, putting, dress, cost, listening, lots, price, saved, smell, mark, peace, gives, crime, dreams, entire, single, usually, department, beer, holy, west, wall, stuck, nose, protect, ways, teach, awful, forever, type, grow, train, detective, billy, rock, planet, walking, beginning, dumb, papers, folks, park, attention, hide, card, birthday, reading, test, share, master, lieutenant, starting, field, partner, twice, enjoy, dollar, blame, film, mess, bomb, round, girlfriend, south, loves, plenty, using, gentlemen, especially, records, evidence, experience, silly, admit, normal, fired, talkin, lock, louis, fighting, mission, notice, memory, promised, crap, wedding, orders, ground, guns, glass, marriage, idiot, heaven, impossible, knock, green, wondering, spent, animal, hole, neck, drugs, press, nuts, names, broken, position, asleep, jerry, visit, boyfriend, acting, plans, feels, tells, paris, smoke, wind, sheriff, cross, holding, gimme, mention, walked, judge, code, double, brothers, writing, pardon, keeps, fellow, fell, closed, angry, lovely, cute, surprised, percent, charles, correct, agree, bathroom, address, andy, ridiculous, summer, tommy, rules, note, account, group, sleeping, learned, sing, pulled, colonel, proud, laugh, river, area, upstairs, jump, built, difficult, breakfast, bobby, bridge, dirty, betty, amazing, locked, north, definitely, alex, feelings, plus, worst, accept, kick, file, wild, seriously, grace, stories, steal, gettin, nature, advice, relationship, contact, waste, places, spot, beach, stole, apart, favorite, knowing, level, song, faith, risk, loose, patient, foot, eating, played, action, witness, washington, turns, build, obviously, begin, split, games, command, crew, decide, nurse, keeping, tight, bird, form, runs, copy, arrest, complete, scene, consider, jeffrey, insane, taste, teeth, shoes, monster, devil, henry, career, sooner, innocent, hall, showed, gift, weekend, heavy, study, greatest, comin, danger, keys, raise, destroy, track, carl, california, concerned, bruce, program, blind, suddenly, hanging, apologize, seventy, chicken, medical, forward, drinking, sweetheart, willing, guard, legs, admiral, shop, professor, suspect, tree, camp, data, ticket, goodnight, possibly, dunno, burn, paying, television, trick, murdered, losing, senator, credit, extra, dropped, sold, warm, meaning, stone, starts, hiding, lately, cheap, marty, taught, science, lookin, simply, majesty, harold, corner, jeff, queen, following, duty, training, seat, heads, cars, discuss, bear, noticed, enemy, helped, screw, richard, flight)
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
topics: Array[Array[(String, Double)]] = Array(Array((just,0.0014102155338741765), (like,0.0012758924372910556), (think,0.0011214448310395873), (home,9.238914780871355E-4), (know,9.047647243869576E-4)), Array((just,0.0014443699497685366), (going,0.0012377629724506722), (like,0.0011714257476524842), (know,0.0010861657304183027), (think,8.604460434628813E-4)), Array((just,0.0014926060533533697), (like,0.0013429026076916017), (yeah,0.0013067364965238173), (know,0.0011607492289313303), (want,0.0011400804862230437)), Array((right,0.006717314446949222), (going,0.006002662754297925), (think,0.004488111770001314), (just,0.004408679383982238), (good,0.0042465917238892655)), Array((know,0.0050059173813691085), (sure,0.0029731088780905225), (want,0.0022359962463711185), (yeah,0.002193246256785973), (going,0.0019111384839030116)), Array((want,0.003714410612506209), (know,0.0017122806517390608), (come,0.0017073041827440282), (just,0.0015712232707115927), (make,0.0012303967042097022)), Array((know,0.00467483294478972), (just,0.0038641828467113268), (going,0.003328578440542597), (come,0.002867941043688811), (like,0.002532629878316373)), Array((know,0.00960017865043255), (like,0.009308573745541343), (tell,0.005704969701604644), (just,0.004085042285865179), (didn,0.004031048471919761)), Array((know,0.004550808496981245), (think,0.004122146617438838), (right,0.0019092043643137734), (fucking,0.0018255598181846045), (okay,0.001761167250972209)), Array((going,0.0016782125889211463), (like,0.0012427279906039904), (right,0.0012197157251243875), (just,0.0010635502545983016), (know,9.50137528050953E-4)), Array((like,0.003126597598330109), (just,0.0027451035751362273), (want,0.00228759303132256), (know,0.0017239166326848171), (going,0.0017047784964894794)), Array((like,0.004734133576359814), (just,0.004201386287998202), (love,0.0036983083453854372), (think,0.0025414887712607768), (want,0.002091795015523375)), Array((know,0.0035340054254694784), (right,0.002387182752907053), (just,0.0019263993964325303), (look,0.001843992584617911), (like,0.0018065489773133325)), Array((like,0.0016017017354850733), (just,0.0014834097260266685), (right,0.0014300356385979168), (mean,0.001294952229819751), (know,0.0012788947989035501)), Array((good,0.002043769246809558), (just,0.0013757478946969802), (come,0.0013208455540129331), (going,0.0012662647575091633), (like,0.0011549537488969965)), Array((know,0.022087503347588935), (just,0.01571524947937798), (like,0.012895996754133662), (want,0.01026452087962411), (think,0.009873743305368164)), Array((know,0.002204551343207476), (just,0.0016283414468010306), (want,0.0014214537687803855), (think,0.0012768751041210551), (tell,0.0011525954268574248)), Array((hell,0.0022031979750387655), (just,0.0020637622110226085), (like,0.0019281346187348387), (okay,0.0015712770524161123), (right,0.0014183600893726285)), Array((know,0.0035729889283848504), (like,0.0024215014894025766), (want,0.0018740761967851508), (right,0.001838630576321126), (yeah,0.0016262171049684524)), Array((want,0.0018098267577494882), (come,0.0015864305565599366), (doing,0.0015861983258874525), (tell,0.001331260635860306), (think,0.0012793651558771885)))

Feel free to take things apart to understand!

topicIndices(0)
res43: (Array[Int], Array[Double]) = (Array(1, 2, 4, 49, 0),Array(0.0014102155338741765, 0.0012758924372910556, 0.0011214448310395873, 9.238914780871355E-4, 9.047647243869576E-4))
topicIndices(0)._1
res44: Array[Int] = Array(1, 2, 4, 49, 0)
topicIndices(0)._1(0)
res45: Int = 1
vocabList(topicIndices(0)._1(0))
res46: String = just

Review Results of LDA model with Online Variational Bayes - Doing all four steps earlier at once.

val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}
20 topics: TOPIC 0 just 0.0014102155338741765 like 0.0012758924372910556 think 0.0011214448310395873 home 9.238914780871355E-4 know 9.047647243869576E-4 ========== TOPIC 1 just 0.0014443699497685366 going 0.0012377629724506722 like 0.0011714257476524842 know 0.0010861657304183027 think 8.604460434628813E-4 ========== TOPIC 2 just 0.0014926060533533697 like 0.0013429026076916017 yeah 0.0013067364965238173 know 0.0011607492289313303 want 0.0011400804862230437 ========== TOPIC 3 right 0.006717314446949222 going 0.006002662754297925 think 0.004488111770001314 just 0.004408679383982238 good 0.0042465917238892655 ========== TOPIC 4 know 0.0050059173813691085 sure 0.0029731088780905225 want 0.0022359962463711185 yeah 0.002193246256785973 going 0.0019111384839030116 ========== TOPIC 5 want 0.003714410612506209 know 0.0017122806517390608 come 0.0017073041827440282 just 0.0015712232707115927 make 0.0012303967042097022 ========== TOPIC 6 know 0.00467483294478972 just 0.0038641828467113268 going 0.003328578440542597 come 0.002867941043688811 like 0.002532629878316373 ========== TOPIC 7 know 0.00960017865043255 like 0.009308573745541343 tell 0.005704969701604644 just 0.004085042285865179 didn 0.004031048471919761 ========== TOPIC 8 know 0.004550808496981245 think 0.004122146617438838 right 0.0019092043643137734 fucking 0.0018255598181846045 okay 0.001761167250972209 ========== TOPIC 9 going 0.0016782125889211463 like 0.0012427279906039904 right 0.0012197157251243875 just 0.0010635502545983016 know 9.50137528050953E-4 ========== TOPIC 10 like 0.003126597598330109 just 0.0027451035751362273 want 0.00228759303132256 know 0.0017239166326848171 going 0.0017047784964894794 ========== TOPIC 11 like 0.004734133576359814 just 0.004201386287998202 love 0.0036983083453854372 think 0.0025414887712607768 want 0.002091795015523375 ========== TOPIC 12 know 0.0035340054254694784 right 0.002387182752907053 just 0.0019263993964325303 look 0.001843992584617911 like 0.0018065489773133325 ========== TOPIC 13 like 0.0016017017354850733 just 0.0014834097260266685 right 0.0014300356385979168 mean 0.001294952229819751 know 0.0012788947989035501 ========== TOPIC 14 good 0.002043769246809558 just 0.0013757478946969802 come 0.0013208455540129331 going 0.0012662647575091633 like 0.0011549537488969965 ========== TOPIC 15 know 0.022087503347588935 just 0.01571524947937798 like 0.012895996754133662 want 0.01026452087962411 think 0.009873743305368164 ========== TOPIC 16 know 0.002204551343207476 just 0.0016283414468010306 want 0.0014214537687803855 think 0.0012768751041210551 tell 0.0011525954268574248 ========== TOPIC 17 hell 0.0022031979750387655 just 0.0020637622110226085 like 0.0019281346187348387 okay 0.0015712770524161123 right 0.0014183600893726285 ========== TOPIC 18 know 0.0035729889283848504 like 0.0024215014894025766 want 0.0018740761967851508 right 0.001838630576321126 yeah 0.0016262171049684524 ========== TOPIC 19 want 0.0018098267577494882 come 0.0015864305565599366 doing 0.0015861983258874525 tell 0.001331260635860306 think 0.0012793651558771885 ========== topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(1, 2, 4, 49, 0),Array(0.0014102155338741765, 0.0012758924372910556, 0.0011214448310395873, 9.238914780871355E-4, 9.047647243869576E-4)), (Array(1, 6, 2, 0, 4),Array(0.0014443699497685366, 0.0012377629724506722, 0.0011714257476524842, 0.0010861657304183027, 8.604460434628813E-4)), (Array(1, 2, 8, 0, 3),Array(0.0014926060533533697, 0.0013429026076916017, 0.0013067364965238173, 0.0011607492289313303, 0.0011400804862230437)), (Array(5, 6, 4, 1, 7),Array(0.006717314446949222, 0.006002662754297925, 0.004488111770001314, 0.004408679383982238, 0.0042465917238892655)), (Array(0, 19, 3, 8, 6),Array(0.0050059173813691085, 0.0029731088780905225, 0.0022359962463711185, 0.002193246256785973, 0.0019111384839030116)), (Array(3, 0, 10, 1, 15),Array(0.003714410612506209, 0.0017122806517390608, 0.0017073041827440282, 0.0015712232707115927, 0.0012303967042097022)), (Array(0, 1, 6, 10, 2),Array(0.00467483294478972, 0.0038641828467113268, 0.003328578440542597, 0.002867941043688811, 0.002532629878316373)), (Array(0, 2, 9, 1, 13),Array(0.00960017865043255, 0.009308573745541343, 0.005704969701604644, 0.004085042285865179, 0.004031048471919761)), (Array(0, 4, 5, 77, 16),Array(0.004550808496981245, 0.004122146617438838, 0.0019092043643137734, 0.0018255598181846045, 0.001761167250972209)), (Array(6, 2, 5, 1, 0),Array(0.0016782125889211463, 0.0012427279906039904, 0.0012197157251243875, 0.0010635502545983016, 9.50137528050953E-4)), (Array(2, 1, 3, 0, 6),Array(0.003126597598330109, 0.0027451035751362273, 0.00228759303132256, 0.0017239166326848171, 0.0017047784964894794)), (Array(2, 1, 27, 4, 3),Array(0.004734133576359814, 0.004201386287998202, 0.0036983083453854372, 0.0025414887712607768, 0.002091795015523375)), (Array(0, 5, 1, 12, 2),Array(0.0035340054254694784, 0.002387182752907053, 0.0019263993964325303, 0.001843992584617911, 0.0018065489773133325)), (Array(2, 1, 5, 14, 0),Array(0.0016017017354850733, 0.0014834097260266685, 0.0014300356385979168, 0.001294952229819751, 0.0012788947989035501)), (Array(7, 1, 10, 6, 2),Array(0.002043769246809558, 0.0013757478946969802, 0.0013208455540129331, 0.0012662647575091633, 0.0011549537488969965)), (Array(0, 1, 2, 3, 4),Array(0.022087503347588935, 0.01571524947937798, 0.012895996754133662, 0.01026452087962411, 0.009873743305368164)), (Array(0, 1, 3, 4, 9),Array(0.002204551343207476, 0.0016283414468010306, 0.0014214537687803855, 0.0012768751041210551, 0.0011525954268574248)), (Array(46, 1, 2, 16, 5),Array(0.0022031979750387655, 0.0020637622110226085, 0.0019281346187348387, 0.0015712770524161123, 0.0014183600893726285)), (Array(0, 2, 3, 5, 8),Array(0.0035729889283848504, 0.0024215014894025766, 0.0018740761967851508, 0.001838630576321126, 0.0016262171049684524)), (Array(3, 10, 30, 9, 4),Array(0.0018098267577494882, 0.0015864305565599366, 0.0015861983258874525, 0.001331260635860306, 0.0012793651558771885))) vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, stop, hear, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, today, school, meet, gone, crazy, wants, damn, forget, cause, problem, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, read, word, doctor, everybody, makes, supposed, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, open, happy, lost, john, hurt, town, ready, alright, late, actually, married, gave, beautiful, soon, jack, times, sleep, door, having, hand, drink, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, change, police, walk, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, inside, reason, stand, break, means, number, tried, high, white, water, suppose, body, sick, game, excuse, party, women, country, waiting, christ, answer, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, fact, million, hands, died, power, stupid, started, shouldn, months, city, boys, dinner, sense, running, hour, shoot, drive, fight, speak, george, ship, living, figure, dear, street, ahead, lady, seven, free, feeling, scared, frank, able, children, outside, moment, safe, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sounds, sister, lives, promise, till, sound, weren, save, poor, cool, asking, shall, plan, bitch, king, daughter, weeks, beat, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, tired, food, company, lucky, pull, wonderful, touch, state, looked, thinks, picture, leaving, words, control, clear, known, special, buddy, luck, order, follow, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, takes, ride, coffee, miles, parents, works, secret, explain, hotel, worse, kidding, past, outta, general, unless, felt, drop, throw, interested, hang, certainly, absolutely, earth, loved, wonder, dark, accident, seeing, simple, turned, doin, clock, date, sweet, meeting, clean, sign, feet, handle, army, music, giving, report, cops, fucked, charlie, information, smart, yesterday, fall, fault, class, bank, month, blow, major, caught, swear, paul, road, talked, choice, boss, plane, david, paid, wear, american, worried, clothes, paper, goodbye, lord, ones, strange, terrible, mistake, given, hurry, blue, finish, murder, kept, apartment, sell, middle, nothin, hasn, careful, meant, walter, moving, changed, imagine, fair, difference, quiet, happening, near, quit, personal, marry, future, figured, rose, agent, kinda, michael, building, mama, early, private, trip, watching, busy, record, certain, jimmy, broke, longer, sake, store, stick, finally, boat, born, sitting, evening, bucks, chief, history, ought, lying, kiss, honor, lunch, darling, favor, fool, uncle, respect, rich, liked, killing, land, peter, tough, interesting, brain, problems, nick, welcome, completely, dick, honest, wake, radio, cash, dude, dance, james, bout, floor, weird, court, calls, jail, window, involved, drunk, johnny, officer, needed, asshole, books, spend, situation, relax, pain, service, dangerous, grand, security, letter, stopped, realize, table, offer, bastard, message, instead, killer, jake, nervous, deep, pass, somethin, evil, english, bought, short, ring, step, picked, likes, voice, eddie, machine, lived, upset, forgot, carry, afternoon, fear, quick, finished, count, forgive, wrote, named, decided, totally, space, team, doubt, pleasure, lawyer, suit, station, gotten, bother, prove, return, pictures, slow, bunch, strong, wearing, driving, list, join, christmas, tape, attack, church, appreciate, force, hungry, standing, college, dying, present, charge, prison, missing, truck, public, board, calm, staying, gold, ball, hardly, hadn, lead, missed, island, government, horse, cover, reach, french, joke, star, fish, mike, moved, america, surprise, soul, seconds, club, self, movies, putting, dress, cost, listening, lots, price, saved, smell, mark, peace, gives, crime, dreams, entire, single, usually, department, beer, holy, west, wall, stuck, nose, protect, ways, teach, awful, forever, type, grow, train, detective, billy, rock, planet, walking, beginning, dumb, papers, folks, park, attention, hide, card, birthday, reading, test, share, master, lieutenant, starting, field, partner, twice, enjoy, dollar, blame, film, mess, bomb, round, girlfriend, south, loves, plenty, using, gentlemen, especially, records, evidence, experience, silly, admit, normal, fired, talkin, lock, louis, fighting, mission, notice, memory, promised, crap, wedding, orders, ground, guns, glass, marriage, idiot, heaven, impossible, knock, green, wondering, spent, animal, hole, neck, drugs, press, nuts, names, broken, position, asleep, jerry, visit, boyfriend, acting, plans, feels, tells, paris, smoke, wind, sheriff, cross, holding, gimme, mention, walked, judge, code, double, brothers, writing, pardon, keeps, fellow, fell, closed, angry, lovely, cute, surprised, percent, charles, correct, agree, bathroom, address, andy, ridiculous, summer, tommy, rules, note, account, group, sleeping, learned, sing, pulled, colonel, proud, laugh, river, area, upstairs, jump, built, difficult, breakfast, bobby, bridge, dirty, betty, amazing, locked, north, definitely, alex, feelings, plus, worst, accept, kick, file, wild, seriously, grace, stories, steal, gettin, nature, advice, relationship, contact, waste, places, spot, beach, stole, apart, favorite, knowing, level, song, faith, risk, loose, patient, foot, eating, played, action, witness, washington, turns, build, obviously, begin, split, games, command, crew, decide, nurse, keeping, tight, bird, form, runs, copy, arrest, complete, scene, consider, jeffrey, insane, taste, teeth, shoes, monster, devil, henry, career, sooner, innocent, hall, showed, gift, weekend, heavy, study, greatest, comin, danger, keys, raise, destroy, track, carl, california, concerned, bruce, program, blind, suddenly, hanging, apologize, seventy, chicken, medical, forward, drinking, sweetheart, willing, guard, legs, admiral, shop, professor, suspect, tree, camp, data, ticket, goodnight, possibly, dunno, burn, paying, television, trick, murdered, losing, senator, credit, extra, dropped, sold, warm, meaning, stone, starts, hiding, lately, cheap, marty, taught, science, lookin, simply, majesty, harold, corner, jeff, queen, following, duty, training, seat, heads, cars, discuss, bear, noticed, enemy, helped, screw, richard, flight) topics: Array[Array[(String, Double)]] = Array(Array((just,0.0014102155338741765), (like,0.0012758924372910556), (think,0.0011214448310395873), (home,9.238914780871355E-4), (know,9.047647243869576E-4)), Array((just,0.0014443699497685366), (going,0.0012377629724506722), (like,0.0011714257476524842), (know,0.0010861657304183027), (think,8.604460434628813E-4)), Array((just,0.0014926060533533697), (like,0.0013429026076916017), (yeah,0.0013067364965238173), (know,0.0011607492289313303), (want,0.0011400804862230437)), Array((right,0.006717314446949222), (going,0.006002662754297925), (think,0.004488111770001314), (just,0.004408679383982238), (good,0.0042465917238892655)), Array((know,0.0050059173813691085), (sure,0.0029731088780905225), (want,0.0022359962463711185), (yeah,0.002193246256785973), (going,0.0019111384839030116)), Array((want,0.003714410612506209), (know,0.0017122806517390608), (come,0.0017073041827440282), (just,0.0015712232707115927), (make,0.0012303967042097022)), Array((know,0.00467483294478972), (just,0.0038641828467113268), (going,0.003328578440542597), (come,0.002867941043688811), (like,0.002532629878316373)), Array((know,0.00960017865043255), (like,0.009308573745541343), (tell,0.005704969701604644), (just,0.004085042285865179), (didn,0.004031048471919761)), Array((know,0.004550808496981245), (think,0.004122146617438838), (right,0.0019092043643137734), (fucking,0.0018255598181846045), (okay,0.001761167250972209)), Array((going,0.0016782125889211463), (like,0.0012427279906039904), (right,0.0012197157251243875), (just,0.0010635502545983016), (know,9.50137528050953E-4)), Array((like,0.003126597598330109), (just,0.0027451035751362273), (want,0.00228759303132256), (know,0.0017239166326848171), (going,0.0017047784964894794)), Array((like,0.004734133576359814), (just,0.004201386287998202), (love,0.0036983083453854372), (think,0.0025414887712607768), (want,0.002091795015523375)), Array((know,0.0035340054254694784), (right,0.002387182752907053), (just,0.0019263993964325303), (look,0.001843992584617911), (like,0.0018065489773133325)), Array((like,0.0016017017354850733), (just,0.0014834097260266685), (right,0.0014300356385979168), (mean,0.001294952229819751), (know,0.0012788947989035501)), Array((good,0.002043769246809558), (just,0.0013757478946969802), (come,0.0013208455540129331), (going,0.0012662647575091633), (like,0.0011549537488969965)), Array((know,0.022087503347588935), (just,0.01571524947937798), (like,0.012895996754133662), (want,0.01026452087962411), (think,0.009873743305368164)), Array((know,0.002204551343207476), (just,0.0016283414468010306), (want,0.0014214537687803855), (think,0.0012768751041210551), (tell,0.0011525954268574248)), Array((hell,0.0022031979750387655), (just,0.0020637622110226085), (like,0.0019281346187348387), (okay,0.0015712770524161123), (right,0.0014183600893726285)), Array((know,0.0035729889283848504), (like,0.0024215014894025766), (want,0.0018740761967851508), (right,0.001838630576321126), (yeah,0.0016262171049684524)), Array((want,0.0018098267577494882), (come,0.0015864305565599366), (doing,0.0015861983258874525), (tell,0.001331260635860306), (think,0.0012793651558771885)))

Going through the results, you may notice that some of the topic words returned are actually stopwords that are specific to our dataset (for eg: "writes", "article"...). Let's try improving our model.

Step 8. Model Tuning - Refilter Stopwords

We will try to improve the results of our model by identifying some stopwords that are specific to our dataset. We will filter these stopwords out and rerun our LDA model to see if we get better results.

val add_stopwords = Array("whatever") // add  more stop-words like the name of your company!
add_stopwords: Array[String] = Array(whatever)
// Combine newly identified stopwords to our exising list of stopwords
val new_stopwords = stopwords.union(add_stopwords)
new_stopwords: Array[String] = Array(a, about, above, across, after, afterwards, again, against, all, almost, alone, along, already, also, although, always, am, among, amongst, amoungst, amount, an, and, another, any, anyhow, anyone, anything, anyway, anywhere, are, around, as, at, back, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, below, beside, besides, between, beyond, bill, both, bottom, but, by, call, can, cannot, cant, co, computer, con, could, couldnt, cry, de, describe, detail, do, done, down, due, during, each, eg, eight, either, eleven, else, elsewhere, empty, enough, etc, even, ever, every, everyone, everything, everywhere, except, few, fifteen, fify, fill, find, fire, first, five, for, former, formerly, forty, found, four, from, front, full, further, get, give, go, had, has, hasnt, have, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, him, himself, his, how, however, hundred, i, ie, if, in, inc, indeed, interest, into, is, it, its, itself, keep, last, latter, latterly, least, less, ltd, made, many, may, me, meanwhile, might, mill, mine, more, moreover, most, mostly, move, much, must, my, myself, name, namely, neither, never, nevertheless, next, nine, no, nobody, none, noone, nor, not, nothing, now, nowhere, of, off, often, on, once, one, only, onto, or, other, others, otherwise, our, ours, ourselves, out, over, own, part, per, perhaps, please, put, rather, re, same, see, seem, seemed, seeming, seems, serious, several, she, should, show, side, since, sincere, six, sixty, so, some, somehow, someone, something, sometime, sometimes, somewhere, still, such, system, take, ten, than, that, the, their, them, themselves, then, thence, there, thereafter, thereby, therefore, therein, thereupon, these, they, thick, thin, third, this, those, though, three, through, throughout, thru, thus, to, together, too, top, toward, towards, twelve, twenty, two, un, under, until, up, upon, us, very, via, was, we, well, were, what, whatever, when, whence, whenever, where, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom, whose, why, will, with, within, without, would, yet, you, your, yours, yourself, yourselves, whatever)
import org.apache.spark.ml.feature.StopWordsRemover

// Set Params for StopWordsRemover with new_stopwords
val remover = new StopWordsRemover()
.setStopWords(new_stopwords)
.setInputCol("tokens")
.setOutputCol("filtered")

// Create new df with new list of stopwords removed
val new_filtered_df = remover.transform(tokenized_df)
import org.apache.spark.ml.feature.StopWordsRemover remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_3d7dc1a9b2ef new_filtered_df: org.apache.spark.sql.DataFrame = [id: bigint, corpus: string ... 4 more fields]
// Set Params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(10000)
.setMinDF(5)
.fit(new_filtered_df)

// Create new df of countVectors
val new_countVectors = vectorizer.transform(new_filtered_df).select("id", "features")
vectorizer: org.apache.spark.ml.feature.CountVectorizerModel = cntVec_2fcb7a8b0dc8 new_countVectors: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]
// Convert DF to RDD
val new_lda_countVector = new_countVectors.map { case Row(id: Long, countVector: MLVector) => (id, Vectors.fromML(countVector)) }.rdd
new_lda_countVector: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[3955] at rdd at command-753740454082314:2

We will also increase MaxIterations to 10 to see if we get better results.

// Set LDA parameters
val new_lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(0.8))
.setK(numTopics)
.setMaxIterations(10)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values
new_lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@5fca2e4f

How to find what the default values are?

Dive into the source!!!

  1. Let's find the default value for docConcentration now.
  2. Got to Apache Spark package Root: https://spark.apache.org/docs/latest/api/scala/#package
  3. search for 'ml' in the search box on the top left (ml is for ml library)
  4. Then find the LDA by scrolling below on the left to mllib's clustering methods and click on LDA
  5. Then click on the source code link which should take you here:

    /**
     * Concentration parameter (commonly named "alpha") for the prior placed on documents'
     * distributions over topics ("theta").
     *
     * This is the parameter to a Dirichlet distribution, where larger values mean more smoothing
     * (more regularization).
     *
     * If not set by the user, then docConcentration is set automatically. If set to
     * singleton vector [alpha], then alpha is replicated to a vector of length k in fitting.
     * Otherwise, the [[docConcentration]] vector must be length k.
     * (default = automatic)
     *
     * Optimizer-specific parameter settings:
     *  - EM
     *     - Currently only supports symmetric distributions, so all values in the vector should be
     *       the same.
     *     - Values should be > 1.0
     *     - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows
     *       from Asuncion et al. (2009), who recommend a +1 adjustment for EM.
     *  - Online
     *     - Values should be >= 0
     *     - default = uniformly (1.0 / k), following the implementation from
     *       [[https://github.com/Blei-Lab/onlineldavb]].
     * @group param
     */
    

HOMEWORK: Try to find the default value for TopicConcentration.

// Create LDA model with stopwords refiltered
val new_ldaModel = new_lda.run(new_lda_countVector)
new_ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@3f1301a7
val topicIndices = new_ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}
20 topics: TOPIC 0 right 0.002368539995607174 love 0.0019026093436816463 just 0.001739005396343051 okay 0.001493567868602809 know 0.0011919944841106388 ========== TOPIC 1 like 0.012255569993473736 just 0.007532527227834193 come 0.007114873840600518 know 0.006960825682483897 think 0.006460380113586568 ========== TOPIC 2 know 0.017593342399778864 yeah 0.01729763439457538 gonna 0.014297985209693677 just 0.009395640487800467 tell 0.007112117826339655 ========== TOPIC 3 just 0.002310885836348927 know 0.0020049203493508585 better 0.001839601963450054 like 0.0016545385663972387 right 0.001505081787498549 ========== TOPIC 4 know 0.012396058201765845 didn 0.004786910731106122 like 0.004783067030382327 right 0.003733205551673614 just 0.0028592628116592403 ========== TOPIC 5 just 0.0028236500929191208 know 0.0026011344347436015 going 0.0015951009390631876 didn 0.001385667983895007 wait 0.001275555813151892 ========== TOPIC 6 going 0.00275337137203844 right 0.001685679960504387 just 0.0015380845174617235 know 0.0014818062892167352 captain 0.0013896743515293423 ========== TOPIC 7 going 0.011956735401221285 just 0.006541063462593452 know 0.005428932374204778 think 0.004308569608730405 believe 0.003696595226603709 ========== TOPIC 8 think 0.0019959039820595533 sorry 0.00198077299794292 know 0.0016723315231586236 shit 0.0015606901977245095 right 0.0013015271817698212 ========== TOPIC 9 know 0.003615862714921936 said 0.001961114693915351 sorry 0.0018595382287745752 like 0.0017819242854891695 think 0.0016468683030306027 ========== TOPIC 10 time 0.008784671423019166 want 0.00282365356227211 sure 0.0024833597381016476 know 0.0019777615447230884 right 0.0016576304456760946 ========== TOPIC 11 just 0.0021068918389201634 like 0.0020497480766035994 know 0.002022347553873645 want 0.0019500819941038825 said 0.001503771370040063 ========== TOPIC 12 look 0.00433587608823225 think 0.0025833796049907604 know 0.002007970741805987 going 0.0016840410422251017 just 0.0010661551551733228 ========== TOPIC 13 know 0.0020279945673448915 come 0.0019980250335794405 think 0.0012733121858788797 going 0.001192108885417234 okay 0.001186180285931844 ========== TOPIC 14 like 0.004262090436242644 right 0.0021537790725358777 just 0.0013683197398457016 know 0.0010911699327713488 look 0.0010869000557749361 ========== TOPIC 15 come 0.004769396664496132 know 0.0026229974920448534 like 0.0021612642420959253 just 0.0013228057897488347 right 0.001171812635848879 ========== TOPIC 16 know 0.025323543461007635 just 0.018361261941348715 like 0.01574431601713426 want 0.014855701536091734 think 0.011957607420818889 ========== TOPIC 17 like 0.004346004035796333 know 0.0022903208899377127 just 0.002008680613491114 little 0.0019547134832950414 maybe 0.0017287784612649724 ========== TOPIC 18 know 0.003217184151682409 think 0.003063734585623867 just 0.0018328245079520728 want 0.0017709019452594528 like 0.0016903614729120188 ========== TOPIC 19 hello 0.008911727886543675 stop 0.0025143616929346174 just 0.0023958078165974795 like 0.00184251815055585 come 0.0018199130672157007 ========== topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(5, 27, 1, 16, 0),Array(0.002368539995607174, 0.0019026093436816463, 0.001739005396343051, 0.001493567868602809, 0.0011919944841106388)), (Array(2, 1, 10, 0, 4),Array(0.012255569993473736, 0.007532527227834193, 0.007114873840600518, 0.006960825682483897, 0.006460380113586568)), (Array(0, 8, 20, 1, 9),Array(0.017593342399778864, 0.01729763439457538, 0.014297985209693677, 0.009395640487800467, 0.007112117826339655)), (Array(1, 0, 36, 2, 5),Array(0.002310885836348927, 0.0020049203493508585, 0.001839601963450054, 0.0016545385663972387, 0.001505081787498549)), (Array(0, 13, 2, 5, 1),Array(0.012396058201765845, 0.004786910731106122, 0.004783067030382327, 0.003733205551673614, 0.0028592628116592403)), (Array(1, 0, 6, 13, 57),Array(0.0028236500929191208, 0.0026011344347436015, 0.0015951009390631876, 0.001385667983895007, 0.001275555813151892)), (Array(6, 5, 1, 0, 233),Array(0.00275337137203844, 0.001685679960504387, 0.0015380845174617235, 0.0014818062892167352, 0.0013896743515293423)), (Array(6, 1, 0, 4, 40),Array(0.011956735401221285, 0.006541063462593452, 0.005428932374204778, 0.004308569608730405, 0.003696595226603709)), (Array(4, 26, 0, 42, 5),Array(0.0019959039820595533, 0.00198077299794292, 0.0016723315231586236, 0.0015606901977245095, 0.0013015271817698212)), (Array(0, 23, 26, 2, 4),Array(0.003615862714921936, 0.001961114693915351, 0.0018595382287745752, 0.0017819242854891695, 0.0016468683030306027)), (Array(11, 3, 19, 0, 5),Array(0.008784671423019166, 0.00282365356227211, 0.0024833597381016476, 0.0019777615447230884, 0.0016576304456760946)), (Array(1, 2, 0, 3, 23),Array(0.0021068918389201634, 0.0020497480766035994, 0.002022347553873645, 0.0019500819941038825, 0.001503771370040063)), (Array(12, 4, 0, 6, 1),Array(0.00433587608823225, 0.0025833796049907604, 0.002007970741805987, 0.0016840410422251017, 0.0010661551551733228)), (Array(0, 10, 4, 6, 16),Array(0.0020279945673448915, 0.0019980250335794405, 0.0012733121858788797, 0.001192108885417234, 0.001186180285931844)), (Array(2, 5, 1, 0, 12),Array(0.004262090436242644, 0.0021537790725358777, 0.0013683197398457016, 0.0010911699327713488, 0.0010869000557749361)), (Array(10, 0, 2, 1, 5),Array(0.004769396664496132, 0.0026229974920448534, 0.0021612642420959253, 0.0013228057897488347, 0.001171812635848879)), (Array(0, 1, 2, 3, 4),Array(0.025323543461007635, 0.018361261941348715, 0.01574431601713426, 0.014855701536091734, 0.011957607420818889)), (Array(2, 0, 1, 18, 24),Array(0.004346004035796333, 0.0022903208899377127, 0.002008680613491114, 0.0019547134832950414, 0.0017287784612649724)), (Array(0, 4, 1, 3, 2),Array(0.003217184151682409, 0.003063734585623867, 0.0018328245079520728, 0.0017709019452594528, 0.0016903614729120188)), (Array(121, 68, 1, 2, 10),Array(0.008911727886543675, 0.0025143616929346174, 0.0023958078165974795, 0.00184251815055585, 0.0018199130672157007))) vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, stop, hear, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, school, today, meet, gone, crazy, wants, damn, forget, cause, problem, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, word, read, doctor, everybody, makes, supposed, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, happy, open, lost, john, hurt, town, ready, alright, late, actually, gave, married, beautiful, soon, jack, times, sleep, door, having, drink, hand, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, walk, change, police, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, reason, inside, stand, break, number, tried, means, high, white, water, suppose, body, sick, game, excuse, party, women, country, answer, waiting, christ, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, million, fact, hands, died, power, started, stupid, shouldn, months, boys, city, sense, dinner, running, hour, shoot, drive, fight, speak, george, living, ship, figure, dear, street, ahead, lady, seven, scared, free, feeling, frank, able, children, outside, safe, moment, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sister, sounds, lives, till, promise, sound, weren, save, poor, cool, asking, shall, plan, king, bitch, daughter, beat, weeks, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, food, tired, company, lucky, pull, wonderful, touch, looked, state, thinks, picture, words, leaving, control, clear, known, special, buddy, luck, follow, order, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, coffee, takes, ride, parents, miles, works, secret, hotel, explain, worse, kidding, past, outta, general, unless, felt, drop, throw, hang, interested, certainly, absolutely, earth, loved, wonder, dark, accident, seeing, doin, turned, simple, clock, date, sweet, meeting, clean, sign, feet, handle, army, music, giving, report, cops, fucked, charlie, information, yesterday, smart, fall, fault, class, bank, month, blow, swear, caught, major, paul, road, talked, choice, boss, plane, david, paid, wear, american, worried, clothes, ones, lord, goodbye, paper, terrible, strange, mistake, given, kept, finish, blue, murder, hurry, apartment, sell, middle, nothin, careful, hasn, meant, walter, moving, changed, fair, imagine, difference, quiet, happening, near, quit, personal, marry, figured, rose, future, building, kinda, agent, early, mama, michael, watching, trip, private, busy, record, certain, jimmy, broke, longer, sake, store, finally, boat, stick, born, sitting, evening, bucks, history, chief, lying, ought, honor, kiss, darling, lunch, uncle, fool, favor, respect, rich, land, liked, killing, peter, tough, brain, interesting, completely, welcome, nick, problems, wake, radio, dick, honest, cash, dance, dude, james, bout, floor, weird, court, jail, calls, window, involved, drunk, johnny, officer, needed, asshole, spend, situation, books, relax, pain, grand, dangerous, service, letter, stopped, security, realize, offer, table, message, bastard, killer, instead, jake, deep, nervous, pass, somethin, evil, english, bought, short, step, ring, picked, likes, machine, voice, eddie, upset, carry, forgot, lived, afternoon, fear, finished, quick, count, forgive, wrote, named, decided, totally, space, team, lawyer, pleasure, doubt, suit, station, gotten, bother, return, prove, slow, pictures, bunch, strong, list, wearing, driving, join, tape, christmas, force, church, attack, appreciate, college, standing, hungry, present, dying, charge, prison, missing, truck, board, public, staying, calm, gold, ball, hardly, hadn, lead, missed, island, government, cover, horse, reach, joke, french, fish, star, america, moved, soul, surprise, mike, putting, seconds, club, self, movies, dress, cost, lots, price, listening, saved, smell, mark, peace, dreams, crime, gives, entire, department, usually, single, holy, west, beer, nose, wall, stuck, protect, ways, teach, train, grow, awful, type, forever, rock, detective, billy, dumb, papers, walking, beginning, planet, folks, park, attention, card, hide, birthday, master, share, lieutenant, starting, test, reading, field, partner, twice, enjoy, film, bomb, mess, blame, dollar, loves, girlfriend, south, round, records, especially, using, plenty, gentlemen, evidence, silly, admit, experience, fired, normal, talkin, lock, mission, memory, louis, fighting, notice, crap, wedding, promised, ground, idiot, orders, marriage, guns, glass, impossible, heaven, knock, spent, neck, wondering, green, animal, hole, press, drugs, nuts, position, broken, names, asleep, jerry, acting, feels, visit, plans, boyfriend, smoke, paris, wind, tells, gimme, holding, cross, sheriff, walked, mention, judge, code, writing, double, brothers, keeps, pardon, fellow, fell, closed, lovely, angry, cute, percent, surprised, charles, agree, bathroom, correct, address, ridiculous, summer, andy, rules, tommy, group, account, note, learned, colonel, pulled, sing, laugh, proud, sleeping, area, built, jump, upstairs, difficult, river, bobby, dirty, breakfast, bridge, betty, locked, amazing, north, alex, definitely, plus, feelings, accept, kick, worst, grace, gettin, wild, stories, steal, seriously, file, relationship, advice, nature, places, waste, contact, spot, apart, knowing, stole, beach, favorite, loose, level, song, faith, risk, played, eating, foot, patient, witness, turns, washington, action, build, obviously, begin, split, crew, command, games, decide, tight, nurse, keeping, bird, form, runs, copy, scene, jeffrey, arrest, complete, taste, consider, insane, teeth, shoes, henry, career, sooner, monster, devil, hall, innocent, showed, study, gift, weekend, heavy, keys, greatest, comin, destroy, danger, track, raise, suddenly, hanging, bruce, carl, california, apologize, concerned, blind, program, medical, chicken, sweetheart, drinking, forward, seventy, willing, shop, guard, legs, suspect, professor, admiral, data, ticket, camp, tree, goodnight, paying, burn, losing, possibly, dunno, television, senator, trick, murdered, dropped, extra, credit, starts, warm, stone, sold, hiding, meaning, taught, marty, cheap, lately, simply, science, lookin, following, harold, queen, majesty, jeff, corner, cars, heads, training, seat, duty, noticed, helped, bear, enemy, discuss, responsible, trial, dave) topics: Array[Array[(String, Double)]] = Array(Array((right,0.002368539995607174), (love,0.0019026093436816463), (just,0.001739005396343051), (okay,0.001493567868602809), (know,0.0011919944841106388)), Array((like,0.012255569993473736), (just,0.007532527227834193), (come,0.007114873840600518), (know,0.006960825682483897), (think,0.006460380113586568)), Array((know,0.017593342399778864), (yeah,0.01729763439457538), (gonna,0.014297985209693677), (just,0.009395640487800467), (tell,0.007112117826339655)), Array((just,0.002310885836348927), (know,0.0020049203493508585), (better,0.001839601963450054), (like,0.0016545385663972387), (right,0.001505081787498549)), Array((know,0.012396058201765845), (didn,0.004786910731106122), (like,0.004783067030382327), (right,0.003733205551673614), (just,0.0028592628116592403)), Array((just,0.0028236500929191208), (know,0.0026011344347436015), (going,0.0015951009390631876), (didn,0.001385667983895007), (wait,0.001275555813151892)), Array((going,0.00275337137203844), (right,0.001685679960504387), (just,0.0015380845174617235), (know,0.0014818062892167352), (captain,0.0013896743515293423)), Array((going,0.011956735401221285), (just,0.006541063462593452), (know,0.005428932374204778), (think,0.004308569608730405), (believe,0.003696595226603709)), Array((think,0.0019959039820595533), (sorry,0.00198077299794292), (know,0.0016723315231586236), (shit,0.0015606901977245095), (right,0.0013015271817698212)), Array((know,0.003615862714921936), (said,0.001961114693915351), (sorry,0.0018595382287745752), (like,0.0017819242854891695), (think,0.0016468683030306027)), Array((time,0.008784671423019166), (want,0.00282365356227211), (sure,0.0024833597381016476), (know,0.0019777615447230884), (right,0.0016576304456760946)), Array((just,0.0021068918389201634), (like,0.0020497480766035994), (know,0.002022347553873645), (want,0.0019500819941038825), (said,0.001503771370040063)), Array((look,0.00433587608823225), (think,0.0025833796049907604), (know,0.002007970741805987), (going,0.0016840410422251017), (just,0.0010661551551733228)), Array((know,0.0020279945673448915), (come,0.0019980250335794405), (think,0.0012733121858788797), (going,0.001192108885417234), (okay,0.001186180285931844)), Array((like,0.004262090436242644), (right,0.0021537790725358777), (just,0.0013683197398457016), (know,0.0010911699327713488), (look,0.0010869000557749361)), Array((come,0.004769396664496132), (know,0.0026229974920448534), (like,0.0021612642420959253), (just,0.0013228057897488347), (right,0.001171812635848879)), Array((know,0.025323543461007635), (just,0.018361261941348715), (like,0.01574431601713426), (want,0.014855701536091734), (think,0.011957607420818889)), Array((like,0.004346004035796333), (know,0.0022903208899377127), (just,0.002008680613491114), (little,0.0019547134832950414), (maybe,0.0017287784612649724)), Array((know,0.003217184151682409), (think,0.003063734585623867), (just,0.0018328245079520728), (want,0.0017709019452594528), (like,0.0016903614729120188)), Array((hello,0.008911727886543675), (stop,0.0025143616929346174), (just,0.0023958078165974795), (like,0.00184251815055585), (come,0.0018199130672157007)))

Step 9. Create LDA model with Expectation Maximization

Let's try creating an LDA model with Expectation Maximization on the data that has been refiltered for additional stopwords. We will also increase MaxIterations here to 100 to see if that improves results. See:

import org.apache.spark.mllib.clustering.EMLDAOptimizer

// Set LDA parameters
val em_lda = new LDA()
.setOptimizer(new EMLDAOptimizer())
.setK(numTopics)
.setMaxIterations(100)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values
import org.apache.spark.mllib.clustering.EMLDAOptimizer em_lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@7c84d0ae
val em_ldaModel = em_lda.run(new_lda_countVector) // takes a long long time 22 minutes
em_ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.DistributedLDAModel@188f58bf
import org.apache.spark.mllib.clustering.DistributedLDAModel;
val em_DldaModel = em_ldaModel.asInstanceOf[DistributedLDAModel]
import org.apache.spark.mllib.clustering.DistributedLDAModel em_DldaModel: org.apache.spark.mllib.clustering.DistributedLDAModel = org.apache.spark.mllib.clustering.DistributedLDAModel@188f58bf
val top10ConversationsPerTopic = em_DldaModel.topDocumentsPerTopic(10)
top10ConversationsPerTopic: Array[(Array[Long], Array[Double])] = Array((Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.03185722402229515, 0.03185722402229515, 0.03185722402229515, 0.03185722402229515, 0.03185722402229515, 0.03185722402229515, 0.031196200176884056, 0.020282154018599348, 0.01099645315549, 0.01099645315549)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.035359403020952286, 0.035359403020952286, 0.035359403020952286, 0.035359403020952286, 0.035359403020952286, 0.035359403020952286, 0.03471449892991739, 0.022506359024306477, 0.0112575667750105, 0.0112575667750105)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.02454380082221751, 0.02454380082221751, 0.02454380082221751, 0.02454380082221751, 0.02454380082221751, 0.02454380082221751, 0.02390214852968437, 0.01563567947724491, 0.01037864738296468, 0.01037864738296468)), (Array(69318, 15221, 15149, 23167, 59606, 51632, 51639, 64470, 67338, 66968),Array(0.9999514001066685, 0.999945172626603, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.05060524129300216, 0.05060524129300216, 0.05060524129300216, 0.05060524129300216, 0.05060524129300216, 0.05060524129300216, 0.05027652950865874, 0.032180017393406625, 0.011630224445618545, 0.011630224445618545)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.043670158470644385, 0.043670158470644385, 0.043670158470644385, 0.043670158470644385, 0.043670158470644385, 0.043670158470644385, 0.04313069897321676, 0.027782187731151566, 0.01176138814023006, 0.01176138814023006)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04718260291563998, 0.04718260291563998, 0.04718260291563998, 0.04718260291563998, 0.04718260291563998, 0.04718260291563998, 0.046744086701796375, 0.030009654606021424, 0.011639894189919175, 0.011639894189919175)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04675497162050312, 0.04675497162050312, 0.04675497162050312, 0.04675497162050312, 0.04675497162050312, 0.04675497162050312, 0.04630740194891603, 0.02973828460805704, 0.011613267488918038, 0.011613267488918038)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.02922347731698308, 0.02922347731698308, 0.02922347731698308, 0.02922347731698308, 0.02922347731698308, 0.02922347731698308, 0.02856137645845995, 0.01860910369309368, 0.010781174428638705, 0.010781174428638705)), (Array(39677, 39693, 39680, 39674, 39682, 39679, 39681, 39676, 41932, 41967),Array(0.05098977390451263, 0.05098977390451263, 0.05098977390451263, 0.05098977390451263, 0.05098977390451263, 0.05098977390451263, 0.05065478074966324, 0.032424728159182487, 0.011804021891259129, 0.011804021891259129)), (Array(39677, 39693, 39680, 39674, 39682, 39679, 39681, 39676, 41932, 41967),Array(0.05458945924257291, 0.05458945924257291, 0.05458945924257291, 0.05458945924257291, 0.05458945924257291, 0.05458945924257291, 0.054382561204063574, 0.03470709039767808, 0.011833044488991173, 0.011833044488991173)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.02757805103727522, 0.02757805103727522, 0.02757805103727522, 0.02757805103727522, 0.02757805103727522, 0.02757805103727522, 0.026914567318365282, 0.017564142155031864, 0.010769649833867787, 0.010769649833867787)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04651386476447619, 0.04651386476447619, 0.04651386476447619, 0.04651386476447619, 0.04651386476447619, 0.04651386476447619, 0.046074356990568124, 0.029584640688217628, 0.011466200232752964, 0.011466200232752964)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.05592908452117546, 0.05592908452117546, 0.05592908452117546, 0.05592908452117546, 0.05592908452117546, 0.05592908452117546, 0.055741511067049866, 0.03555776746162745, 0.01211237486139149, 0.01211237486139149)), (Array(39681, 39674, 39677, 39693, 39680, 39682, 39679, 39676, 41932, 41967),Array(0.06048967215526035, 0.060441959490269634, 0.060441959490269634, 0.060441959490269634, 0.060441959490269634, 0.060441959490269634, 0.060441959490269634, 0.03841642394224334, 0.011870824949431247, 0.011870824949431247)), (Array(39681, 39679, 39677, 39693, 39680, 39674, 39682, 39676, 41932, 41967),Array(0.06567035036792095, 0.06540012688944775, 0.06540012688944775, 0.06540012688944775, 0.06540012688944775, 0.06540012688944775, 0.06540012688944775, 0.041559067084071775, 0.012094068526188358, 0.012094068526188358)), (Array(39681, 39674, 39677, 39693, 39680, 39682, 39679, 39676, 41932, 41967),Array(0.07103855727399273, 0.07041257887551527, 0.07041257887551527, 0.07041257887551527, 0.07041257887551527, 0.07041257887551527, 0.07041257887551527, 0.04473147169952923, 0.011812341727214322, 0.011812341727214322)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04379464422302749, 0.04379464422302749, 0.04379464422302749, 0.04379464422302749, 0.04379464422302749, 0.04379464422302749, 0.04327692350658171, 0.027860179926002853, 0.01152922335209424, 0.01152922335209424)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04365837465882906, 0.04365837465882906, 0.04365837465882906, 0.04365837465882906, 0.04365837465882906, 0.04365837465882906, 0.04316378247947061, 0.02777238708460723, 0.011237658307104376, 0.011237658307104376)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04674235176753242, 0.04674235176753242, 0.04674235176753242, 0.04674235176753242, 0.04674235176753242, 0.04674235176753242, 0.0463101731327656, 0.02972951504690979, 0.011452462524389802, 0.011452462524389802)))
top10ConversationsPerTopic.length // number of topics
res52: Int = 20
//em_DldaModel.topicDistributions.take(10).foreach(println)

Note that the EMLDAOptimizer produces a DistributedLDAModel, which stores not only the inferred topics but also the full training corpus and topic distributions for each document in the training corpus.

val topicIndices = em_ldaModel.describeTopics(maxTermsPerTopic = 5)
topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(6435, 9153, 2611, 9555, 9235),Array(1.0844350865928232E-5, 1.4037356622456141E-6, 1.0198257636937534E-6, 1.010016392533973E-6, 9.877489659219E-7)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.2201894817101623E-5, 1.4560010186049552E-6, 1.0547580487281058E-6, 1.0446104695648421E-6, 1.0214202904824573E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(8.080320102037276E-6, 1.2828806625265042E-6, 9.387148884503143E-7, 9.296944883594565E-7, 9.095512260026888E-7)), (Array(0, 1, 2, 3, 4),Array(0.4097048012129488, 0.2966641691130405, 0.28104437242573427, 0.2068481221090779, 0.20178462784115517)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.7791420865642426E-5, 1.5285401934315644E-6, 1.1022151610359566E-6, 1.0916092052333647E-6, 1.0671154286074535E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.5488156532652564E-5, 1.5613578155095174E-6, 1.1250530213722066E-6, 1.1142275765190935E-6, 1.0891766415036671E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.66201985348282E-5, 1.5337088341752489E-6, 1.1062252459821718E-6, 1.0955808549686414E-6, 1.0710096202234095E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.6434785283305463E-5, 1.527062738898831E-6, 1.1015632294086975E-6, 1.0909636379478556E-6, 1.066504587082138E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(9.82890555944203E-6, 1.360381381982805E-6, 9.90695338703216E-7, 9.811686105969582E-7, 9.596620143926599E-7)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.8098274080649888E-5, 1.5662560052135424E-6, 1.127571968783498E-6, 1.1167221871321394E-6, 1.0915664277968502E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.9443267750173392E-5, 1.5746049595955017E-6, 1.1333735056120856E-6, 1.1224679386855895E-6, 1.0971718558358495E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(9.292996004992278E-6, 1.3619125930485615E-6, 9.924672219451632E-7, 9.82924355173023E-7, 9.614096002911668E-7)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.619330796598465E-5, 1.4932367796221485E-6, 1.0785114269963956E-6, 1.06813378362302E-6, 1.0442595139466752E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(2.0195445781462442E-5, 1.6338598744234947E-6, 1.1726861776132844E-6, 1.1614034519421386E-6, 1.1350541791534873E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(2.1543159186970775E-5, 1.5791785506830092E-6, 1.1358076217376717E-6, 1.1248786573437884E-6, 1.099486352793341E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(2.3565018803229148E-5, 1.6252544688003071E-6, 1.16608206417593E-6, 1.1548627950846766E-6, 1.1286452359926982E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(2.498926755354901E-5, 1.5618937315237142E-6, 1.1234358831022108E-6, 1.1126257210374892E-6, 1.0875181953216021E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.5342892698391062E-5, 1.5117065677915513E-6, 1.0917779017440848E-6, 1.0812727863583168E-6, 1.057095929328646E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.5018034022313325E-5, 1.4466343222454145E-6, 1.04735014561389E-6, 1.0372732437703543E-6, 1.0142213705569144E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.627670929533595E-5, 1.4917359134556584E-6, 1.0776961757775105E-6, 1.067326467379095E-6, 1.043483836251971E-6)))
val vocabList = vectorizer.vocabulary
vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, hear, stop, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, today, school, meet, gone, crazy, wants, damn, forget, problem, cause, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, word, read, doctor, everybody, supposed, makes, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, open, happy, lost, john, hurt, town, ready, alright, late, actually, married, gave, beautiful, soon, jack, times, sleep, door, having, drink, hand, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, walk, change, police, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, reason, inside, stand, break, means, number, tried, high, white, water, suppose, body, sick, game, excuse, party, women, country, answer, christ, waiting, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, fact, million, died, hands, power, stupid, started, shouldn, months, boys, city, sense, dinner, running, hour, shoot, fight, drive, speak, george, ship, living, figure, dear, street, ahead, lady, seven, scared, free, feeling, frank, able, children, safe, moment, outside, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sounds, sister, promise, lives, till, sound, weren, save, poor, cool, shall, asking, plan, king, bitch, daughter, weeks, beat, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, company, food, tired, lucky, pull, wonderful, touch, looked, thinks, state, picture, leaving, words, control, clear, known, special, buddy, luck, order, follow, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, takes, ride, coffee, miles, parents, works, secret, hotel, explain, kidding, worse, past, outta, general, felt, drop, unless, throw, interested, hang, certainly, absolutely, earth, loved, dark, wonder, accident, seeing, turned, clock, simple, doin, date, sweet, meeting, clean, sign, feet, handle, music, report, giving, army, fucked, cops, charlie, smart, yesterday, information, fall, fault, bank, class, month, blow, swear, caught, major, paul, road, talked, choice, plane, boss, david, paid, wear, american, worried, lord, paper, goodbye, clothes, ones, terrible, strange, given, mistake, finish, kept, blue, murder, hurry, apartment, sell, middle, nothin, careful, hasn, meant, walter, moving, changed, imagine, fair, difference, quiet, happening, near, quit, personal, marry, figured, future, rose, building, mama, michael, early, agent, kinda, watching, private, trip, record, certain, busy, jimmy, broke, sake, longer, store, boat, stick, finally, born, evening, sitting, bucks, ought, chief, lying, history, kiss, honor, darling, lunch, favor, fool, uncle, respect, rich, land, liked, killing, peter, tough, brain, interesting, completely, problems, welcome, nick, wake, honest, radio, dick, cash, dance, dude, james, bout, floor, weird, court, calls, jail, drunk, window, involved, johnny, officer, needed, asshole, situation, spend, books, relax, pain, service, grand, dangerous, letter, security, stopped, offer, realize, table, bastard, message, instead, killer, jake, deep, nervous, somethin, pass, evil, english, bought, short, step, ring, picked, likes, machine, eddie, voice, upset, forgot, carry, lived, afternoon, fear, quick, finished, count, forgive, wrote, named, decided, totally, space, team, pleasure, doubt, lawyer, station, gotten, suit, bother, prove, return, slow, pictures, bunch, strong, list, wearing, driving, join, tape, christmas, attack, appreciate, force, church, college, hungry, standing, present, dying, prison, missing, charge, board, truck, public, calm, gold, staying, ball, hardly, hadn, missed, lead, island, government, horse, cover, french, reach, joke, fish, star, mike, surprise, america, moved, soul, dress, seconds, club, self, putting, movies, lots, cost, listening, price, saved, smell, mark, peace, dreams, entire, crime, gives, usually, single, department, holy, beer, west, protect, stuck, wall, nose, ways, teach, forever, grow, train, type, awful, rock, detective, billy, walking, dumb, papers, beginning, planet, folks, park, attention, birthday, hide, card, master, share, reading, test, starting, lieutenant, field, partner, enjoy, twice, film, dollar, bomb, mess, blame, south, loves, girlfriend, round, records, using, plenty, especially, gentlemen, evidence, silly, experience, admit, fired, normal, talkin, mission, louis, memory, fighting, lock, notice, crap, wedding, promised, marriage, ground, guns, glass, idiot, orders, impossible, heaven, knock, hole, neck, animal, spent, green, wondering, nuts, press, drugs, broken, position, names, asleep, jerry, visit, boyfriend, acting, feels, plans, paris, smoke, tells, wind, cross, holding, sheriff, gimme, walked, mention, writing, double, brothers, code, judge, pardon, keeps, fellow, fell, closed, lovely, angry, cute, charles, surprised, percent, correct, bathroom, agree, address, andy, ridiculous, summer, tommy, rules, group, account, note, pulled, sleeping, sing, learned, proud, laugh, colonel, upstairs, river, difficult, built, jump, area, dirty, betty, bridge, breakfast, bobby, locked, amazing, north, feelings, alex, plus, definitely, worst, accept, kick, seriously, grace, steal, wild, stories, file, gettin, relationship, advice, nature, contact, spot, places, waste, knowing, beach, stole, apart, favorite, faith, level, loose, risk, song, eating, foot, played, patient, washington, turns, witness, action, build, obviously, begin, split, crew, command, games, tight, decide, nurse, keeping, runs, form, bird, copy, insane, complete, arrest, consider, taste, scene, jeffrey, teeth, shoes, career, henry, sooner, devil, monster, showed, weekend, gift, innocent, study, heavy, hall, comin, danger, greatest, track, keys, raise, destroy, concerned, program, carl, blind, apologize, suddenly, hanging, bruce, california, chicken, seventy, forward, drinking, sweetheart, medical, suspect, admiral, guard, shop, professor, legs, willing, camp, data, ticket, tree, goodnight, television, losing, senator, murdered, burn, dunno, paying, possibly, trick, dropped, credit, extra, starts, warm, hiding, meaning, sold, stone, taught, marty, lately, cheap, lookin, science, simply, jeff, corner, harold, following, majesty, queen, duty, cars, training, heads, seat, discuss, bear, enemy, helped, noticed, common, screw, dave)
vocabList.size
res32: Int = 10000
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
topics: Array[Array[(String, Double)]] = Array(Array((just,0.030515134931284552), (like,0.02463563559747823), (want,0.022529385381465025), (damn,0.02094828832824297), (going,0.0203407289886203)), Array((yeah,0.10787301090151602), (look,0.0756831002291994), (know,0.04815746564274915), (wait,0.03897182014529944), (night,0.0341458394828345)), Array((gonna,0.08118584492034046), (money,0.051736711600637544), (shit,0.04620430294274594), (fuck,0.0399843125556081), (kill,0.03672740843080258)), Array((people,0.020091372023286612), (know,0.018613400462887356), (work,0.016775643603287843), (does,0.015522555458447744), (think,0.012161168331925723)), Array((know,0.031956573561538214), (just,0.030674598809934856), (want,0.027663491240851962), (tell,0.025727217382788027), (right,0.02300853167338119)), Array((love,0.05932570200934131), (father,0.030080735900045442), (life,0.01769248067468245), (true,0.016281752071881345), (young,0.014927950883812253)), Array((remember,0.03998401809663685), (went,0.01737965538107633), (lost,0.016916065536574213), (called,0.016443441316683228), (story,0.014849882671062261)), Array((house,0.028911209424810257), (miss,0.025669944694943093), (right,0.02091105252727788), (family,0.017862939987512365), (important,0.013959164390834044)), Array((saying,0.022939827090645636), (know,0.021335083902970984), (idea,0.017628999871937747), (business,0.017302568063786224), (police,0.012284217866942303)), Array((know,0.051876601466269136), (like,0.03828159069993671), (maybe,0.03754385940676905), (just,0.031938551661426284), (want,0.02876693222824349)), Array((years,0.032537676027398765), (going,0.030596831997667568), (case,0.02049555392502822), (doctor,0.018671171294737107), (working,0.017672067172167016)), Array((stuff,0.02236582778896705), (school,0.020057798194969816), (john,0.017134198006217606), (week,0.017075852415410653), (thousand,0.017013413435021035)), Array((little,0.08663446368316245), (girl,0.035120377589734936), (like,0.02992080326340266), (woman,0.0240813719635157), (baby,0.022471517953608963)), Array((know,0.0283115823590395), (leave,0.02744935904744228), (time,0.02050833156294194), (want,0.020124145131863225), (just,0.019466336438890477)), Array((didn,0.08220031921979461), (like,0.05062323326717784), (real,0.03087838046777391), (guess,0.02452989702353384), (says,0.022815035397008333)), Array((minutes,0.018541518543996716), (time,0.014737962244588431), (captain,0.012594614743931537), (thirty,0.01193707771669708), (ship,0.011260576815409516)), Array((okay,0.08153575328080886), (just,0.050004142902999975), (right,0.03438984898476042), (know,0.02821327795933634), (home,0.023397063860326372)), Array((country,0.011270500385627474), (power,0.010428408353623762), (president,0.009392162067926028), (fight,0.00799742811584178), (possible,0.007597974486019279)), Array((know,0.09541058020800194), (think,0.0698707939786508), (really,0.06881812755565207), (mean,0.02909700228968688), (just,0.028699687473471538)), Array((dead,0.03833642117149438), (like,0.017873711992106994), (hand,0.015280854355409379), (white,0.013718491413582671), (blood,0.012699265888344448)))
vocabList(47) // 47 is the index of the term 'university' or the first term in topics - this may change due to randomness in algorithm
res33: String = doesn

This is just doing it all at once.

val topicIndices = em_ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}
20 topics: TOPIC 0 just 0.030515134931284552 like 0.02463563559747823 want 0.022529385381465025 damn 0.02094828832824297 going 0.0203407289886203 ========== TOPIC 1 yeah 0.10787301090151602 look 0.0756831002291994 know 0.04815746564274915 wait 0.03897182014529944 night 0.0341458394828345 ========== TOPIC 2 gonna 0.08118584492034046 money 0.051736711600637544 shit 0.04620430294274594 fuck 0.0399843125556081 kill 0.03672740843080258 ========== TOPIC 3 people 0.020091372023286612 know 0.018613400462887356 work 0.016775643603287843 does 0.015522555458447744 think 0.012161168331925723 ========== TOPIC 4 know 0.031956573561538214 just 0.030674598809934856 want 0.027663491240851962 tell 0.025727217382788027 right 0.02300853167338119 ========== TOPIC 5 love 0.05932570200934131 father 0.030080735900045442 life 0.01769248067468245 true 0.016281752071881345 young 0.014927950883812253 ========== TOPIC 6 remember 0.03998401809663685 went 0.01737965538107633 lost 0.016916065536574213 called 0.016443441316683228 story 0.014849882671062261 ========== TOPIC 7 house 0.028911209424810257 miss 0.025669944694943093 right 0.02091105252727788 family 0.017862939987512365 important 0.013959164390834044 ========== TOPIC 8 saying 0.022939827090645636 know 0.021335083902970984 idea 0.017628999871937747 business 0.017302568063786224 police 0.012284217866942303 ========== TOPIC 9 know 0.051876601466269136 like 0.03828159069993671 maybe 0.03754385940676905 just 0.031938551661426284 want 0.02876693222824349 ========== TOPIC 10 years 0.032537676027398765 going 0.030596831997667568 case 0.02049555392502822 doctor 0.018671171294737107 working 0.017672067172167016 ========== TOPIC 11 stuff 0.02236582778896705 school 0.020057798194969816 john 0.017134198006217606 week 0.017075852415410653 thousand 0.017013413435021035 ========== TOPIC 12 little 0.08663446368316245 girl 0.035120377589734936 like 0.02992080326340266 woman 0.0240813719635157 baby 0.022471517953608963 ========== TOPIC 13 know 0.0283115823590395 leave 0.02744935904744228 time 0.02050833156294194 want 0.020124145131863225 just 0.019466336438890477 ========== TOPIC 14 didn 0.08220031921979461 like 0.05062323326717784 real 0.03087838046777391 guess 0.02452989702353384 says 0.022815035397008333 ========== TOPIC 15 minutes 0.018541518543996716 time 0.014737962244588431 captain 0.012594614743931537 thirty 0.01193707771669708 ship 0.011260576815409516 ========== TOPIC 16 okay 0.08153575328080886 just 0.050004142902999975 right 0.03438984898476042 know 0.02821327795933634 home 0.023397063860326372 ========== TOPIC 17 country 0.011270500385627474 power 0.010428408353623762 president 0.009392162067926028 fight 0.00799742811584178 possible 0.007597974486019279 ========== TOPIC 18 know 0.09541058020800194 think 0.0698707939786508 really 0.06881812755565207 mean 0.02909700228968688 just 0.028699687473471538 ========== TOPIC 19 dead 0.03833642117149438 like 0.017873711992106994 hand 0.015280854355409379 white 0.013718491413582671 blood 0.012699265888344448 ========== topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(1, 2, 3, 135, 6),Array(0.030515134931284552, 0.02463563559747823, 0.022529385381465025, 0.02094828832824297, 0.0203407289886203)), (Array(8, 12, 0, 57, 32),Array(0.10787301090151602, 0.0756831002291994, 0.04815746564274915, 0.03897182014529944, 0.0341458394828345)), (Array(20, 35, 42, 51, 58),Array(0.08118584492034046, 0.051736711600637544, 0.04620430294274594, 0.0399843125556081, 0.03672740843080258)), (Array(22, 0, 34, 43, 4),Array(0.020091372023286612, 0.018613400462887356, 0.016775643603287843, 0.015522555458447744, 0.012161168331925723)), (Array(0, 1, 3, 9, 5),Array(0.031956573561538214, 0.030674598809934856, 0.027663491240851962, 0.025727217382788027, 0.02300853167338119)), (Array(27, 74, 31, 168, 202),Array(0.05932570200934131, 0.030080735900045442, 0.01769248067468245, 0.016281752071881345, 0.014927950883812253)), (Array(53, 92, 180, 113, 166),Array(0.03998401809663685, 0.01737965538107633, 0.016916065536574213, 0.016443441316683228, 0.014849882671062261)), (Array(78, 110, 5, 171, 232),Array(0.028911209424810257, 0.025669944694943093, 0.02091105252727788, 0.017862939987512365, 0.013959164390834044)), (Array(119, 0, 107, 106, 219),Array(0.022939827090645636, 0.021335083902970984, 0.017628999871937747, 0.017302568063786224, 0.012284217866942303)), (Array(0, 2, 24, 1, 3),Array(0.051876601466269136, 0.03828159069993671, 0.03754385940676905, 0.031938551661426284, 0.02876693222824349)), (Array(41, 6, 140, 162, 177),Array(0.032537676027398765, 0.030596831997667568, 0.02049555392502822, 0.018671171294737107, 0.017672067172167016)), (Array(118, 130, 181, 174, 170),Array(0.02236582778896705, 0.020057798194969816, 0.017134198006217606, 0.017075852415410653, 0.017013413435021035)), (Array(18, 62, 2, 114, 122),Array(0.08663446368316245, 0.035120377589734936, 0.02992080326340266, 0.0240813719635157, 0.022471517953608963)), (Array(0, 64, 11, 3, 1),Array(0.0283115823590395, 0.02744935904744228, 0.02050833156294194, 0.020124145131863225, 0.019466336438890477)), (Array(13, 2, 67, 59, 111),Array(0.08220031921979461, 0.05062323326717784, 0.03087838046777391, 0.02452989702353384, 0.022815035397008333)), (Array(158, 11, 233, 274, 295),Array(0.018541518543996716, 0.014737962244588431, 0.012594614743931537, 0.01193707771669708, 0.011260576815409516)), (Array(16, 1, 5, 0, 49),Array(0.08153575328080886, 0.050004142902999975, 0.03438984898476042, 0.02821327795933634, 0.023397063860326372)), (Array(257, 279, 313, 291, 351),Array(0.011270500385627474, 0.010428408353623762, 0.009392162067926028, 0.00799742811584178, 0.007597974486019279)), (Array(0, 4, 17, 14, 1),Array(0.09541058020800194, 0.0698707939786508, 0.06881812755565207, 0.02909700228968688, 0.028699687473471538)), (Array(54, 2, 198, 248, 266),Array(0.03833642117149438, 0.017873711992106994, 0.015280854355409379, 0.013718491413582671, 0.012699265888344448))) vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, hear, stop, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, today, school, meet, gone, crazy, wants, damn, forget, problem, cause, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, word, read, doctor, everybody, supposed, makes, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, open, happy, lost, john, hurt, town, ready, alright, late, actually, married, gave, beautiful, soon, jack, times, sleep, door, having, drink, hand, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, walk, change, police, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, reason, inside, stand, break, means, number, tried, high, white, water, suppose, body, sick, game, excuse, party, women, country, answer, christ, waiting, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, fact, million, died, hands, power, stupid, started, shouldn, months, boys, city, sense, dinner, running, hour, shoot, fight, drive, speak, george, ship, living, figure, dear, street, ahead, lady, seven, scared, free, feeling, frank, able, children, safe, moment, outside, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sounds, sister, promise, lives, till, sound, weren, save, poor, cool, shall, asking, plan, king, bitch, daughter, weeks, beat, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, company, food, tired, lucky, pull, wonderful, touch, looked, thinks, state, picture, leaving, words, control, clear, known, special, buddy, luck, order, follow, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, takes, ride, coffee, miles, parents, works, secret, hotel, explain, kidding, worse, past, outta, general, felt, drop, unless, throw, interested, hang, certainly, absolutely, earth, loved, dark, wonder, accident, seeing, turned, clock, simple, doin, date, sweet, meeting, clean, sign, feet, handle, music, report, giving, army, fucked, cops, charlie, smart, yesterday, information, fall, fault, bank, class, month, blow, swear, caught, major, paul, road, talked, choice, plane, boss, david, paid, wear, american, worried, lord, paper, goodbye, clothes, ones, terrible, strange, given, mistake, finish, kept, blue, murder, hurry, apartment, sell, middle, nothin, careful, hasn, meant, walter, moving, changed, imagine, fair, difference, quiet, happening, near, quit, personal, marry, figured, future, rose, building, mama, michael, early, agent, kinda, watching, private, trip, record, certain, busy, jimmy, broke, sake, longer, store, boat, stick, finally, born, evening, sitting, bucks, ought, chief, lying, history, kiss, honor, darling, lunch, favor, fool, uncle, respect, rich, land, liked, killing, peter, tough, brain, interesting, completely, problems, welcome, nick, wake, honest, radio, dick, cash, dance, dude, james, bout, floor, weird, court, calls, jail, drunk, window, involved, johnny, officer, needed, asshole, situation, spend, books, relax, pain, service, grand, dangerous, letter, security, stopped, offer, realize, table, bastard, message, instead, killer, jake, deep, nervous, somethin, pass, evil, english, bought, short, step, ring, picked, likes, machine, eddie, voice, upset, forgot, carry, lived, afternoon, fear, quick, finished, count, forgive, wrote, named, decided, totally, space, team, pleasure, doubt, lawyer, station, gotten, suit, bother, prove, return, slow, pictures, bunch, strong, list, wearing, driving, join, tape, christmas, attack, appreciate, force, church, college, hungry, standing, present, dying, prison, missing, charge, board, truck, public, calm, gold, staying, ball, hardly, hadn, missed, lead, island, government, horse, cover, french, reach, joke, fish, star, mike, surprise, america, moved, soul, dress, seconds, club, self, putting, movies, lots, cost, listening, price, saved, smell, mark, peace, dreams, entire, crime, gives, usually, single, department, holy, beer, west, protect, stuck, wall, nose, ways, teach, forever, grow, train, type, awful, rock, detective, billy, walking, dumb, papers, beginning, planet, folks, park, attention, birthday, hide, card, master, share, reading, test, starting, lieutenant, field, partner, enjoy, twice, film, dollar, bomb, mess, blame, south, loves, girlfriend, round, records, using, plenty, especially, gentlemen, evidence, silly, experience, admit, fired, normal, talkin, mission, louis, memory, fighting, lock, notice, crap, wedding, promised, marriage, ground, guns, glass, idiot, orders, impossible, heaven, knock, hole, neck, animal, spent, green, wondering, nuts, press, drugs, broken, position, names, asleep, jerry, visit, boyfriend, acting, feels, plans, paris, smoke, tells, wind, cross, holding, sheriff, gimme, walked, mention, writing, double, brothers, code, judge, pardon, keeps, fellow, fell, closed, lovely, angry, cute, charles, surprised, percent, correct, bathroom, agree, address, andy, ridiculous, summer, tommy, rules, group, account, note, pulled, sleeping, sing, learned, proud, laugh, colonel, upstairs, river, difficult, built, jump, area, dirty, betty, bridge, breakfast, bobby, locked, amazing, north, feelings, alex, plus, definitely, worst, accept, kick, seriously, grace, steal, wild, stories, file, gettin, relationship, advice, nature, contact, spot, places, waste, knowing, beach, stole, apart, favorite, faith, level, loose, risk, song, eating, foot, played, patient, washington, turns, witness, action, build, obviously, begin, split, crew, command, games, tight, decide, nurse, keeping, runs, form, bird, copy, insane, complete, arrest, consider, taste, scene, jeffrey, teeth, shoes, career, henry, sooner, devil, monster, showed, weekend, gift, innocent, study, heavy, hall, comin, danger, greatest, track, keys, raise, destroy, concerned, program, carl, blind, apologize, suddenly, hanging, bruce, california, chicken, seventy, forward, drinking, sweetheart, medical, suspect, admiral, guard, shop, professor, legs, willing, camp, data, ticket, tree, goodnight, television, losing, senator, murdered, burn, dunno, paying, possibly, trick, dropped, credit, extra, starts, warm, hiding, meaning, sold, stone, taught, marty, lately, cheap, lookin, science, simply, jeff, corner, harold, following, majesty, queen, duty, cars, training, heads, seat, discuss, bear, enemy, helped, noticed, common, screw, dave) topics: Array[Array[(String, Double)]] = Array(Array((just,0.030515134931284552), (like,0.02463563559747823), (want,0.022529385381465025), (damn,0.02094828832824297), (going,0.0203407289886203)), Array((yeah,0.10787301090151602), (look,0.0756831002291994), (know,0.04815746564274915), (wait,0.03897182014529944), (night,0.0341458394828345)), Array((gonna,0.08118584492034046), (money,0.051736711600637544), (shit,0.04620430294274594), (fuck,0.0399843125556081), (kill,0.03672740843080258)), Array((people,0.020091372023286612), (know,0.018613400462887356), (work,0.016775643603287843), (does,0.015522555458447744), (think,0.012161168331925723)), Array((know,0.031956573561538214), (just,0.030674598809934856), (want,0.027663491240851962), (tell,0.025727217382788027), (right,0.02300853167338119)), Array((love,0.05932570200934131), (father,0.030080735900045442), (life,0.01769248067468245), (true,0.016281752071881345), (young,0.014927950883812253)), Array((remember,0.03998401809663685), (went,0.01737965538107633), (lost,0.016916065536574213), (called,0.016443441316683228), (story,0.014849882671062261)), Array((house,0.028911209424810257), (miss,0.025669944694943093), (right,0.02091105252727788), (family,0.017862939987512365), (important,0.013959164390834044)), Array((saying,0.022939827090645636), (know,0.021335083902970984), (idea,0.017628999871937747), (business,0.017302568063786224), (police,0.012284217866942303)), Array((know,0.051876601466269136), (like,0.03828159069993671), (maybe,0.03754385940676905), (just,0.031938551661426284), (want,0.02876693222824349)), Array((years,0.032537676027398765), (going,0.030596831997667568), (case,0.02049555392502822), (doctor,0.018671171294737107), (working,0.017672067172167016)), Array((stuff,0.02236582778896705), (school,0.020057798194969816), (john,0.017134198006217606), (week,0.017075852415410653), (thousand,0.017013413435021035)), Array((little,0.08663446368316245), (girl,0.035120377589734936), (like,0.02992080326340266), (woman,0.0240813719635157), (baby,0.022471517953608963)), Array((know,0.0283115823590395), (leave,0.02744935904744228), (time,0.02050833156294194), (want,0.020124145131863225), (just,0.019466336438890477)), Array((didn,0.08220031921979461), (like,0.05062323326717784), (real,0.03087838046777391), (guess,0.02452989702353384), (says,0.022815035397008333)), Array((minutes,0.018541518543996716), (time,0.014737962244588431), (captain,0.012594614743931537), (thirty,0.01193707771669708), (ship,0.011260576815409516)), Array((okay,0.08153575328080886), (just,0.050004142902999975), (right,0.03438984898476042), (know,0.02821327795933634), (home,0.023397063860326372)), Array((country,0.011270500385627474), (power,0.010428408353623762), (president,0.009392162067926028), (fight,0.00799742811584178), (possible,0.007597974486019279)), Array((know,0.09541058020800194), (think,0.0698707939786508), (really,0.06881812755565207), (mean,0.02909700228968688), (just,0.028699687473471538)), Array((dead,0.03833642117149438), (like,0.017873711992106994), (hand,0.015280854355409379), (white,0.013718491413582671), (blood,0.012699265888344448)))
top10ConversationsPerTopic(2)
res54: (Array[Long], Array[Double]) = (Array(22243, 39967, 18136, 18149, 59043, 61513, 34087, 75874, 66270, 68876),Array(0.9986758340945384, 0.99866200816902, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165))
top10ConversationsPerTopic(2)._1
res55: Array[Long] = Array(22243, 39967, 18136, 18149, 59043, 61513, 34087, 75874, 66270, 68876)
val scenesForTopic2 = sc.parallelize(top10ConversationsPerTopic(2)._1).toDF("id")
scenesForTopic2: org.apache.spark.sql.DataFrame = [id: bigint]
display(scenesForTopic2.join(corpusDF,"id"))
22243Fuck him. :-()-: Don't. :-()-: Fuck her too.panic room2002
59043Are you ok? :-()-: Fuck no.magnolia1999
66270Hey now... what the fuck... ? :-()-: Again.red white black & blue2006
75874What about Moliere? :-()-: Fuck off.the beach2000/I
68876What the fuck is that? :-()-: A switchblade.seven1979
34087Fuck me! Yes! :-()-: Uh...american pie1999
61513What the fuck is that?! :-()-: Screamer.arcade1993
18136What the fuck was that about? :-()-: She was jonesing for me.made2001
18149C'mon... :-()-: Fuck...made2001
39967Shit, shit, shit... :-()-: You're almost there, you can do it -- can do -- can do.broadcast news1987
sc.parallelize(top10ConversationsPerTopic(2)._1).toDF("id").join(corpusDF,"id").show(10,false)
+-----+----------------------------------------------------------------------------------+----------------------+---------+ |id |corpus |movieTitle |movieYear| +-----+----------------------------------------------------------------------------------+----------------------+---------+ |22243|Fuck him. :-()-: Don't. :-()-: Fuck her too. |panic room |2002 | |59043|Are you ok? :-()-: Fuck no. |magnolia |1999 | |66270|Hey now... what the fuck... ? :-()-: Again. |red white black & blue|2006 | |75874|What about Moliere? :-()-: Fuck off. |the beach |2000/I | |68876|What the fuck is that? :-()-: A switchblade. |seven |1979 | |34087|Fuck me! Yes! :-()-: Uh... |american pie |1999 | |61513|What the fuck is that?! :-()-: Screamer. |arcade |1993 | |18136|What the fuck was that about? :-()-: She was jonesing for me. |made |2001 | |18149|C'mon... :-()-: Fuck... |made |2001 | |39967|Shit, shit, shit... :-()-: You're almost there, you can do it -- can do -- can do.|broadcast news |1987 | +-----+----------------------------------------------------------------------------------+----------------------+---------+
sc.parallelize(top10ConversationsPerTopic(5)._1).toDF("id").join(corpusDF,"id").show(10,false)
+-----+---------------------------------------------------------+-----------------+---------+ |id |corpus |movieTitle |movieYear| +-----+---------------------------------------------------------+-----------------+---------+ |68250|I love you man :-()-: I love you too. |say anything... |1989 | |31256|I love you. :-()-: I love you. |total recall |1990 | |868 |I love you. :-()-: I love you. |8mm |1999 | |17285|Do me. :-()-: I love you. :-()-: I love you. |little nicky |2000 | |56529|Why do you love me? :-()-: Why do you love me? |jerry maguire |1996 | |67529|I love you, too. :-()-: I love you. I love you. |runaway bride |1999 | |82132|Why did you say that? :-()-: Say what? :-()-: I love you.|willow |1988 | |50163|I love you, Bud. :-()-: I love you more. |frequency |2000 | |39173|I love you. :-()-: I love you too, Dad. |body of evidence |1993 | |57385|Yes? :-()-: I love you... |kramer vs. kramer|1979 | +-----+---------------------------------------------------------+-----------------+---------+
corpusDF.show(5)
+-----+--------------------+------------+---------+ | id| corpus| movieTitle|movieYear| +-----+--------------------+------------+---------+ |17668|This would be fun...|lost horizon| 1937| |17598|Cave, eh? Where? ...|lost horizon| 1937| |17663|Something grand a...|lost horizon| 1937| |17593|You see? You get ...|lost horizon| 1937| |17658|Let me up! Let me...|lost horizon| 1937| +-----+--------------------+------------+---------+ only showing top 5 rows