035_LDA_CornellMovieDialogs(Scala)

Archived YouTube video of this live unedited lab-lecture:

Archived YouTube video of this live unedited lab-lecture Archived YouTube video of this live unedited lab-lecture Archived YouTube video of this live unedited lab-lecture Archived YouTube video of this live unedited lab-lecture

Topic Modeling of Movie Dialogs with Latent Dirichlet Allocation

Let us cluster the conversations from different movies!

This notebook will provide a brief algorithm summary, links for further reading, and an example of how to use LDA for Topic Modeling.

not tested in Spark 2.2 yet (see 034 notebook for syntactic issues, if any)

Algorithm Summary

Readings for LDA

Also read the methodological and more formal papers cited in the above links if you want to know more.

Let's get a bird's eye view of LDA from https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf next.

  • See pictures (hopefully you read the paper last night!)
  • Algorithm of the generative model (this is unsupervised clustering)
  • For a careful introduction to the topic see Section 27.3 and 27.4 (pages 950-970) pf Murphy's Machine Learning: A Probabilistic Perspective, MIT Press, 2012.
  • We will be quite application focussed or applied here!

Probabilistic Topic Modeling Example

This is an outline of our Topic Modeling workflow. Feel free to jump to any subtopic to find out more.

  • Step 0. Dataset Review
  • Step 1. Downloading and Loading Data into DBFS
    • (Step 1. only needs to be done once per shard - see details at the end of the notebook for Step 1.)
  • Step 2. Loading the Data and Data Cleaning
  • Step 3. Text Tokenization
  • Step 4. Remove Stopwords
  • Step 5. Vector of Token Counts
  • Step 6. Create LDA model with Online Variational Bayes
  • Step 7. Review Topics
  • Step 8. Model Tuning - Refilter Stopwords
  • Step 9. Create LDA model with Expectation Maximization
  • Step 10. Visualize Results

Step 0. Dataset Review

In this example, we will use the Cornell Movie Dialogs Corpus.

Here is the README.txt:



Cornell Movie-Dialogs Corpus

Distributed together with:

"Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs" Cristian Danescu-Niculescu-Mizil and Lillian Lee Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.

(this paper is included in this zip file)

NOTE: If you have results to report on these corpora, please send email to cristian@cs.cornell.edu or llee@cs.cornell.edu so we can add you to our list of people using this data. Thanks!

Contents of this README:

    A) Brief description
    B) Files description
    C) Details on the collection procedure
    D) Contact

A) Brief description:

This corpus contains a metadata-rich collection of fictional conversations extracted from raw movie scripts:

  • 220,579 conversational exchanges between 10,292 pairs of movie characters
  • involves 9,035 characters from 617 movies
  • in total 304,713 utterances
  • movie metadata included:
      - genres
      - release year
      - IMDB rating
      - number of IMDB votes
      - IMDB rating
    
  • character metadata included:
      - gender (for 3,774 characters)
      - position on movie credits (3,321 characters)
    

B) Files description:

In all files the field separator is " +++$+++ "

  • movie_titles_metadata.txt

      - contains information about each movie title
      - fields:
              - movieID,
              - movie title,
              - movie year,
              - IMDB rating,
              - no. IMDB votes,
              - genres in the format ['genre1','genre2',...,'genreN']
    
  • movie_characters_metadata.txt

      - contains information about each movie character
      - fields:
              - characterID
              - character name
              - movieID
              - movie title
              - gender ("?" for unlabeled cases)
              - position in credits ("?" for unlabeled cases)
    
  • movie_lines.txt

      - contains the actual text of each utterance
      - fields:
              - lineID
              - characterID (who uttered this phrase)
              - movieID
              - character name
              - text of the utterance
    
  • movie_conversations.txt

      - the structure of the conversations
      - fields
              - characterID of the first character involved in the conversation
              - characterID of the second character involved in the conversation
              - movieID of the movie in which the conversation occurred
              - list of the utterances that make the conversation, in chronological
                      order: ['lineID1','lineID2',...,'lineIDN']
                      has to be matched with movie_lines.txt to reconstruct the actual content
    
  • raw_script_urls.txt

      - the urls from which the raw sources were retrieved
    

C) Details on the collection procedure:

We started from raw publicly available movie scripts (sources acknowledged in raw_script_urls.txt). In order to collect the metadata necessary for this study and to distinguish between two script versions of the same movie, we automatically matched each script with an entry in movie database provided by IMDB (The Internet Movie Database; data interfaces available at http://www.imdb.com/interfaces). Some amount of manual correction was also involved. When more than one movie with the same title was found in IMBD, the match was made with the most popular title (the one that received most IMDB votes)

After discarding all movies that could not be matched or that had less than 5 IMDB votes, we were left with 617 unique titles with metadata including genre, release year, IMDB rating and no. of IMDB votes and cast distribution. We then identified the pairs of characters that interact and separated their conversations automatically using simple data processing heuristics. After discarding all pairs that exchanged less than 5 conversational exchanges there were 10,292 left, exchanging 220,579 conversational exchanges (304,713 utterances). After automatically matching the names of the 9,035 involved characters to the list of cast distribution, we used the gender of each interpreting actor to infer the fictional gender of a subset of 3,321 movie characters (we raised the number of gendered 3,774 characters through manual annotation). Similarly, we collected the end credit position of a subset of 3,321 characters as a proxy for their status.

D) Contact:

Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil)



Step 2. Loading the Data and Data Cleaning

We have already used the wget command to download the file, and put it in our distributed file system (this process takes about 1 minute). To repeat these steps or to download data from another source follow the steps at the bottom of this worksheet on Step 1. Downloading and Loading Data into DBFS.

Let's make sure these files are in dbfs now:

// this is where the data resides in dbfs (see below to download it first, if you go to a new shard!)
display(dbutils.fs.ls("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/")) 
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/README.txtREADME.txt4181
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_characters_metadata.txtmovie_characters_metadata.txt705695
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txtmovie_conversations.txt6760930
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txtmovie_lines.txt34641919
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txtmovie_titles_metadata.txt67289
dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/raw_script_urls.txtraw_script_urls.txt56177

Conversations Data

// Load text file, leave out file paths, convert all strings to lowercase
val conversationsRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt").zipWithIndex()
conversationsRaw: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[50] at zipWithIndex at <console>:33

Review first 5 lines to get a sense for the data format.

conversationsRaw.top(5).foreach(println) // the first five Strings in the RDD
(u999 +++$+++ u1006 +++$+++ m65 +++$+++ ['L227588', 'L227589', 'L227590', 'L227591', 'L227592', 'L227593', 'L227594', 'L227595', 'L227596'],8954) (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228159', 'L228160'],8952) (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228157', 'L228158'],8951) (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228130', 'L228131'],8950) (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228127', 'L228128', 'L228129'],8949)
conversationsRaw.count // there are over 83,000 conversations in total
res0: Long = 83097
import scala.util.{Failure, Success}

val regexConversation = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(\[.*\]\s*$)""".r

case class conversationLine(a: String, b: String, c: String, d: String)

val conversationsRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt")
 .zipWithIndex()
  .map(x => 
          {
            val id:Long = x._2
            val line = x._1
            val pLine = regexConversation.findFirstMatchIn(line)
                               .map(m => conversationLine(m.group(1), m.group(3), m.group(5), m.group(7))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              (id,pLine)
           }
  )
import scala.util.{Failure, Success} regexConversation: scala.util.matching.Regex = \s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(\[.*\]\s*$) defined class conversationLine conversationsRaw: org.apache.spark.rdd.RDD[(Long, Product with Serializable with scala.util.Try[conversationLine])] = MapPartitionsRDD[57] at map at <console>:40
conversationsRaw.filter(x => x._2.isSuccess).count()
res1: Long = 83097
conversationsRaw.filter(x => x._2.isFailure).count()
res66: Long = 0

The conversation number and line numbers of each conversation are in one line in conversationsRaw.

conversationsRaw.filter(x => x._2.isSuccess).take(5).foreach(println)
(0,Success(conversationLine(u0,u2,m0,['L194', 'L195', 'L196', 'L197']))) (1,Success(conversationLine(u0,u2,m0,['L198', 'L199']))) (2,Success(conversationLine(u0,u2,m0,['L200', 'L201', 'L202', 'L203']))) (3,Success(conversationLine(u0,u2,m0,['L204', 'L205', 'L206']))) (4,Success(conversationLine(u0,u2,m0,['L207', 'L208'])))

Let's create conversations that have just the coversation id and line-number with order information.

val conversations 
    = conversationsRaw
      .filter(x => x._2.isSuccess)
      .flatMap { 
        case (id,Success(l))  
                  => { val conv = l.d.replace("[","").replace("]","").replace("'","").replace(" ","")
                       val convLinesIndexed = conv.split(",").zipWithIndex
                       convLinesIndexed.map( cLI => (id, cLI._2, cLI._1))
                      }
       }.toDF("conversationID","intraConversationID","lineID")
conversations: org.apache.spark.sql.DataFrame = [conversationID: bigint, intraConversationID: int, lineID: string]
conversations.show(15)
+--------------+-------------------+------+ |conversationID|intraConversationID|lineID| +--------------+-------------------+------+ | 0| 0| L194| | 0| 1| L195| | 0| 2| L196| | 0| 3| L197| | 1| 0| L198| | 1| 1| L199| | 2| 0| L200| | 2| 1| L201| | 2| 2| L202| | 2| 3| L203| | 3| 0| L204| | 3| 1| L205| | 3| 2| L206| | 4| 0| L207| | 4| 1| L208| +--------------+-------------------+------+ only showing top 15 rows

Movie Titles

val moviesMetaDataRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
moviesMetaDataRaw.top(5).foreach(println)
m99 +++$+++ indiana jones and the temple of doom +++$+++ 1984 +++$+++ 7.50 +++$+++ 112054 +++$+++ ['action', 'adventure'] m98 +++$+++ indiana jones and the last crusade +++$+++ 1989 +++$+++ 8.30 +++$+++ 174947 +++$+++ ['action', 'adventure', 'thriller', 'action', 'adventure', 'fantasy'] m97 +++$+++ independence day +++$+++ 1996 +++$+++ 6.60 +++$+++ 151698 +++$+++ ['action', 'adventure', 'sci-fi', 'thriller'] m96 +++$+++ invaders from mars +++$+++ 1953 +++$+++ 6.40 +++$+++ 2115 +++$+++ ['horror', 'sci-fi'] m95 +++$+++ i am legend +++$+++ 2007 +++$+++ 7.10 +++$+++ 156084 +++$+++ ['drama', 'sci-fi', 'thriller'] moviesMetaDataRaw: org.apache.spark.rdd.RDD[String] = dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt MapPartitionsRDD[73] at textFile at <console>:33
moviesMetaDataRaw.count() // number of movies
res4: Long = 617
import scala.util.{Failure, Success}

/*  - contains information about each movie title
  - fields:
          - movieID,
          - movie title,
          - movie year,
          - IMDB rating,
          - no. IMDB votes,
          - genres in the format ['genre1','genre2',...,'genreN']
          */
val regexMovieMetaData = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(\[.*\]\s*$)""".r

case class lineInMovieMetaData(movieID: String, movieTitle: String, movieYear: String, IMDBRating: String, NumIMDBVotes: String, genres: String)

val moviesMetaDataRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
  .map(line => 
          {
            val pLine = regexMovieMetaData.findFirstMatchIn(line)
                               .map(m => lineInMovieMetaData(m.group(1), m.group(3), m.group(5), m.group(7), m.group(9), m.group(11))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              pLine
           }
  )
import scala.util.{Failure, Success} regexMovieMetaData: scala.util.matching.Regex = \s*(\w+)\s+(\+{3}\$\+{3})\s*(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(\[.*\]\s*$) defined class lineInMovieMetaData moviesMetaDataRaw: org.apache.spark.rdd.RDD[Product with Serializable with scala.util.Try[lineInMovieMetaData]] = MapPartitionsRDD[79] at map at <console>:49
moviesMetaDataRaw.count
res5: Long = 617
moviesMetaDataRaw.filter(x => x.isSuccess).count()
res6: Long = 617
moviesMetaDataRaw.filter(x => x.isSuccess).take(10).foreach(println)
Success(lineInMovieMetaData(m0,10 things i hate about you,1999,6.90,62847,['comedy', 'romance'])) Success(lineInMovieMetaData(m1,1492: conquest of paradise,1992,6.20,10421,['adventure', 'biography', 'drama', 'history'])) Success(lineInMovieMetaData(m2,15 minutes,2001,6.10,25854,['action', 'crime', 'drama', 'thriller'])) Success(lineInMovieMetaData(m3,2001: a space odyssey,1968,8.40,163227,['adventure', 'mystery', 'sci-fi'])) Success(lineInMovieMetaData(m4,48 hrs.,1982,6.90,22289,['action', 'comedy', 'crime', 'drama', 'thriller'])) Success(lineInMovieMetaData(m5,the fifth element,1997,7.50,133756,['action', 'adventure', 'romance', 'sci-fi', 'thriller'])) Success(lineInMovieMetaData(m6,8mm,1999,6.30,48212,['crime', 'mystery', 'thriller'])) Success(lineInMovieMetaData(m7,a nightmare on elm street 4: the dream master,1988,5.20,13590,['fantasy', 'horror', 'thriller'])) Success(lineInMovieMetaData(m8,a nightmare on elm street: the dream child,1989,4.70,11092,['fantasy', 'horror', 'thriller'])) Success(lineInMovieMetaData(m9,the atomic submarine,1959,4.90,513,['sci-fi', 'thriller']))
//moviesMetaDataRaw.filter(x => x.isFailure).take(10).foreach(println) // to regex refine for casting
val moviesMetaData 
    = moviesMetaDataRaw
      .filter(x => x.isSuccess)
      .map { case Success(l) => l }
      .toDF().select("movieID","movieTitle","movieYear")
moviesMetaData: org.apache.spark.sql.DataFrame = [movieID: string, movieTitle: string, movieYear: string]
moviesMetaData.show(10,false)
+-------+---------------------------------------------+---------+ |movieID|movieTitle |movieYear| +-------+---------------------------------------------+---------+ |m0 |10 things i hate about you |1999 | |m1 |1492: conquest of paradise |1992 | |m2 |15 minutes |2001 | |m3 |2001: a space odyssey |1968 | |m4 |48 hrs. |1982 | |m5 |the fifth element |1997 | |m6 |8mm |1999 | |m7 |a nightmare on elm street 4: the dream master|1988 | |m8 |a nightmare on elm street: the dream child |1989 | |m9 |the atomic submarine |1959 | +-------+---------------------------------------------+---------+ only showing top 10 rows

Lines Data

val linesRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")
linesRaw: org.apache.spark.rdd.RDD[String] = dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt MapPartitionsRDD[94] at textFile at <console>:35
linesRaw.count() // number of lines making up the conversations
res9: Long = 304713

Review first 5 lines to get a sense for the data format.

linesRaw.top(5).foreach(println)
L99999 +++$+++ u4166 +++$+++ m278 +++$+++ DULANEY +++$+++ You didn't know about it before that? L99998 +++$+++ u4168 +++$+++ m278 +++$+++ JOANNE +++$+++ To show you this. It's a letter from that lawyer, Koehler. He wrote it to me the day after I saw him. He's the one who told me I could get the money if Miss Lawson went to jail. L99997 +++$+++ u4166 +++$+++ m278 +++$+++ DULANEY +++$+++ Why'd you come here? L99996 +++$+++ u4168 +++$+++ m278 +++$+++ JOANNE +++$+++ I'm gonna go to jail. I know they're gonna make it look like I did it. They gotta put it on someone. L99995 +++$+++ u4168 +++$+++ m278 +++$+++ JOANNE +++$+++ What do you think I've got? A gun? Maybe I'm gonna kill you too. Maybe I'll blow your head off right now.

To see 5 random lines in the lines.txt evaluate the following cell.

linesRaw.takeSample(false, 5).foreach(println)
L144853 +++$+++ u4635 +++$+++ m306 +++$+++ QUINN +++$+++ M.J., I'm going to have to borrow Ruben. The alien-smuggling thing in Chinatown is going down tomorrow night and Jack's kid got hit by a car. I gotta give Ruben to Nikko. L597838 +++$+++ u8315 +++$+++ m565 +++$+++ AULON +++$+++ Yes, my lord. L107915 +++$+++ u613 +++$+++ m39 +++$+++ BOURNE +++$+++ -- it's always bad and it's never anything but bits and pieces anyway! You ever think that maybe it's just making it worse? You don't wonder that? L65662 +++$+++ u3864 +++$+++ m256 +++$+++ AUDREY +++$+++ ...Who is this? L124159 +++$+++ u4395 +++$+++ m291 +++$+++ MALE VOICE +++$+++ She's yours. What are we waiting on?
import scala.util.{Failure, Success}

/*  field in line.txt are:
          - lineID
          - characterID (who uttered this phrase)
          - movieID
          - character name
          - text of the utterance
          */
val regexLine = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(.+)\s+(\2)\s*(.*$)""".r

case class lineInMovie(lineID: String, characterID: String, movieID: String, characterName: String, text: String)

val linesRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")
  .map(line => 
          {
            val pLine = regexLine.findFirstMatchIn(line)
                               .map(m => lineInMovie(m.group(1), m.group(3), m.group(5), m.group(7), m.group(9))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              pLine
           }
  )
import scala.util.{Failure, Success} regexLine: scala.util.matching.Regex = \s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(.+)\s+(\2)\s*(.*$) defined class lineInMovie linesRaw: org.apache.spark.rdd.RDD[Product with Serializable with scala.util.Try[lineInMovie]] = MapPartitionsRDD[101] at map at <console>:49
linesRaw.filter(x => x.isSuccess).count()
res11: Long = 304713
linesRaw.filter(x => x.isFailure).count()
res12: Long = 0
linesRaw.filter(x => x.isSuccess).take(5).foreach(println)
Success(lineInMovie(L1045,u0,m0,BIANCA,They do not!)) Success(lineInMovie(L1044,u2,m0,CAMERON,They do to!)) Success(lineInMovie(L985,u0,m0,BIANCA,I hope so.)) Success(lineInMovie(L984,u2,m0,CAMERON,She okay?)) Success(lineInMovie(L925,u0,m0,BIANCA,Let's go.))

Let's make a DataFrame out of the successfully parsed line.

val lines 
    = linesRaw
      .filter(x => x.isSuccess)
      .map { case Success(l) => l }
      .toDF()
      .join(moviesMetaData, "movieID") // and join it to get movie meta data
lines: org.apache.spark.sql.DataFrame = [movieID: string, lineID: string, characterID: string, characterName: string, text: string, movieTitle: string, movieYear: string]
lines.show(5)
+-------+-------+-----------+-------------------+--------------------+------------+---------+ |movieID| lineID|characterID| characterName| text| movieTitle|movieYear| +-------+-------+-----------+-------------------+--------------------+------------+---------+ | m124|L357776| u1889|ASSISTANT SECRETARY|Let me have it. ...|lost horizon| 1937| | m124|L357775| u1892| CLERK|Conway's gone aga...|lost horizon| 1937| | m124|L357774| u1889|ASSISTANT SECRETARY|I'll dispatch a c...|lost horizon| 1937| | m124|L357773| u1892| CLERK| Yes, sir.|lost horizon| 1937| | m124|L357772| u1889|ASSISTANT SECRETARY|Yes. Might as wel...|lost horizon| 1937| +-------+-------+-----------+-------------------+--------------------+------------+---------+ only showing top 5 rows

Dialogs with Lines

Let's join ght two DataFrames on lineID next.

val convLines = conversations.join(lines, "lineID").sort($"conversationID", $"intraConversationID")
convLines: org.apache.spark.sql.DataFrame = [lineID: string, conversationID: bigint, intraConversationID: int, movieID: string, characterID: string, characterName: string, text: string, movieTitle: string, movieYear: string]
convLines.count
res15: Long = 304713
conversations.count
res16: Long = 304713
display(convLines)
L19400m0u0BIANCACan we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.10 things i hate about you1999
L19501m0u2CAMERONWell, I thought we'd start with pronunciation, if that's okay with you.10 things i hate about you1999
L19602m0u0BIANCANot the hacking and gagging and spitting part. Please.10 things i hate about you1999
L19703m0u2CAMERONOkay... then how 'bout we try out some French cuisine. Saturday? Night?10 things i hate about you1999
L19810m0u0BIANCAYou're asking me out. That's so cute. What's your name again?10 things i hate about you1999
L19911m0u2CAMERONForget it.10 things i hate about you1999
L20020m0u0BIANCANo, no, it's my fault -- we didn't have a proper introduction ---10 things i hate about you1999
L20121m0u2CAMERONCameron.10 things i hate about you1999
L20222m0u0BIANCAThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser. My sister. I can't date until she does.10 things i hate about you1999
L20323m0u2CAMERONSeems like she could get a date easy enough...10 things i hate about you1999
L20430m0u2CAMERONWhy?10 things i hate about you1999
L20531m0u0BIANCAUnsolved mystery. She used to be really popular when she started high school, then it was just like she got sick of it or something.10 things i hate about you1999
L20632m0u2CAMERONThat's a shame.10 things i hate about you1999
L20740m0u0BIANCAGosh, if only we could find Kat a boyfriend...10 things i hate about you1999
L20841m0u2CAMERONLet me see what I can do.10 things i hate about you1999
L27150m0u0BIANCAC'esc ma tete. This is my head10 things i hate about you1999
L27251m0u2CAMERONRight. See? You're ready for the quiz.10 things i hate about you1999
L27352m0u0BIANCAI don't want to know how to say that though. I want to know useful things. Like where the good stores are. How much does champagne cost? Stuff like Chat. I have never in my life had to point out my head to someone.10 things i hate about you1999
L27453m0u2CAMERONThat's because it's such a nice one.10 things i hate about you1999
L27554m0u0BIANCAForget French.10 things i hate about you1999
L27660m0u0BIANCAHow is our little Find the Wench A Date plan progressing?10 things i hate about you1999
L27761m0u2CAMERONWell, there's someone I think might be --10 things i hate about you1999
L28070m0u2CAMERONThere.10 things i hate about you1999
L28171m0u0BIANCAWhere?10 things i hate about you1999
L36380m0u2CAMERONYou got something on your mind?10 things i hate about you1999
L36481m0u0BIANCAI counted on you to help my cause. You and that thug are obviously failing. Aren't we ever going on our date?10 things i hate about you1999
L36590m0u2CAMERONYou have my word. As a gentleman10 things i hate about you1999
L36691m0u0BIANCAYou're sweet.10 things i hate about you1999
L367100m0u2CAMERONHow do you get your hair to look like that?10 things i hate about you1999
L368101m0u0BIANCAEber's Deep Conditioner every two days. And I never, ever use a blowdryer without the diffuser attachment.10 things i hate about you1999
L401110m0u2CAMERONSure have.10 things i hate about you1999
L402111m0u0BIANCAI really, really, really wanna go, but I can't. Not unless my sister goes.10 things i hate about you1999
L403112m0u2CAMERONI'm workin' on it. But she doesn't seem to be goin' for him.10 things i hate about you1999
L404120m0u2CAMERONShe's not a...10 things i hate about you1999
L405121m0u0BIANCALesbian? No. I found a picture of Jared Leto in one of her drawers, so I'm pretty sure she's not harboring same-sex tendencies.10 things i hate about you1999
L406122m0u2CAMERONSo that's the kind of guy she likes? Pretty ones?10 things i hate about you1999
L407123m0u0BIANCAWho knows? All I've ever heard her say is that she'd dip before dating a guy that smokes.10 things i hate about you1999
L575130m0u0BIANCAHi.10 things i hate about you1999
L576131m0u2CAMERONLooks like things worked out tonight, huh?10 things i hate about you1999
L577140m0u0BIANCAYou know Chastity?10 things i hate about you1999
L578141m0u2CAMERONI believe we share an art instructor10 things i hate about you1999
L662150m0u2CAMERONHave fun tonight?10 things i hate about you1999
L663151m0u0BIANCATons10 things i hate about you1999
L693160m0u2CAMERONI looked for you back at the party, but you always seemed to be "occupied".10 things i hate about you1999
L694161m0u0BIANCAI was?10 things i hate about you1999
L695162m0u2CAMERONYou never wanted to go out with 'me, did you?10 things i hate about you1999
L696170m0u0BIANCAWell, no...10 things i hate about you1999
L697171m0u2CAMERONThen that's all you had to say.10 things i hate about you1999
L698172m0u0BIANCABut10 things i hate about you1999
L699173m0u2CAMERONYou always been this selfish?10 things i hate about you1999
L860180m0u0BIANCAThen Guillermo says, "If you go any lighter, you're gonna look like an extra on 90210."10 things i hate about you1999
L861181m0u2CAMERONNo...10 things i hate about you1999
L862190m0u0BIANCAdo you listen to this crap?10 things i hate about you1999
L863191m0u2CAMERONWhat crap?10 things i hate about you1999
L864192m0u0BIANCAMe. This endless ...blonde babble. I'm like, boring myself.10 things i hate about you1999
L865193m0u2CAMERONThank God! If I had to hear one more story about your coiffure...10 things i hate about you1999
L866200m0u2CAMERONI figured you'd get to the good stuff eventually.10 things i hate about you1999
L867201m0u0BIANCAWhat good stuff?10 things i hate about you1999
L868202m0u2CAMERONThe "real you".10 things i hate about you1999
L869203m0u0BIANCALike my fear of wearing pastels?10 things i hate about you1999
L870210m0u0BIANCAI'm kidding. You know how sometimes you just become this "persona"? And you don't know how to quit?10 things i hate about you1999
L871211m0u2CAMERONNo10 things i hate about you1999
L872212m0u0BIANCAOkay -- you're gonna need to learn how to lie.10 things i hate about you1999
L924220m0u2CAMERONWow10 things i hate about you1999
L925221m0u0BIANCALet's go.10 things i hate about you1999
L984230m0u2CAMERONShe okay?10 things i hate about you1999
L985231m0u0BIANCAI hope so.10 things i hate about you1999
L1044240m0u2CAMERONThey do to!10 things i hate about you1999
L1045241m0u0BIANCAThey do not!10 things i hate about you1999
L49250m0u0BIANCADid you change your hair?10 things i hate about you1999
L50251m0u3CHASTITYNo.10 things i hate about you1999
L51252m0u0BIANCAYou might wanna think about it10 things i hate about you1999
L571260m0u0BIANCAWhere did he go? He was just here.10 things i hate about you1999
L572261m0u3CHASTITYWho?10 things i hate about you1999
L573262m0u0BIANCAJoey.10 things i hate about you1999
L579270m0u3CHASTITYGreat10 things i hate about you1999
L580271m0u0BIANCAWould you mind getting me a drink, Cameron?10 things i hate about you1999
L595280m0u0BIANCAHe practically proposed when he found out we had the same dermatologist. I mean. Dr. Bonchowski is great an all, but he's not exactly relevant party conversation.10 things i hate about you1999
L596281m0u3CHASTITYIs he oily or dry?10 things i hate about you1999
L597282m0u0BIANCACombination. I don't know -- I thought he'd be different. More of a gentleman...10 things i hate about you1999
L598290m0u3CHASTITYBianca, I don't think the highlights of dating Joey Dorsey are going to include door-opening and coat-holding.10 things i hate about you1999
L599291m0u0BIANCASometimes I wonder if the guys we're supposed to want to go out with are the ones we actually want to go out with, you know?10 things i hate about you1999
L600292m0u3CHASTITYAll I know is -- I'd give up my private line to go out with a guy like Joey.10 things i hate about you1999
L659300m0u0BIANCAI have to be home in twenty minutes.10 things i hate about you1999
L660301m0u3CHASTITYI don't have to be home 'til two.10 things i hate about you1999
L952310m0u3CHASTITYYou think you ' re the only sophomore at the prom?10 things i hate about you1999
L953311m0u0BIANCAI did.10 things i hate about you1999
L394320m0u4JOEYIt's more10 things i hate about you1999
L395321m0u0BIANCAExpensive?10 things i hate about you1999
L396330m0u4JOEYExactly So, you going to Bogey Lowenbrau's thing on Saturday?10 things i hate about you1999
L397331m0u0BIANCAHopefully.10 things i hate about you1999
L589340m0u4JOEYSo yeah, I've got the Sears catalog thing going -- and the tube sock gig " that's gonna be huge. And then I'm up for an ad for Queen Harry next week.10 things i hate about you1999
L590341m0u0BIANCAQueen Harry?10 things i hate about you1999
L591342m0u4JOEYIt's a gay cruise line, but I'll be, like, wearing a uniform and stuff.10 things i hate about you1999
L592350m0u0BIANCANeat...10 things i hate about you1999
L593351m0u4JOEYMy agent says I've got a good shot at being the Prada guy next year.10 things i hate about you1999
L756360m0u4JOEYHey, sweet cheeks.10 things i hate about you1999
L757361m0u0BIANCAHi, Joey.10 things i hate about you1999
L758362m0u4JOEYYou're concentrating awfully hard considering it's gym class.10 things i hate about you1999
L759370m0u4JOEYListen, I want to talk to you about the prom.10 things i hate about you1999

Showing the first 1000 rows.

Let's amalgamate the texts utered in the same conversations together.

By doing this we loose all the information in the order of utterance.

But this is fine as we are going to do LDA with just the first-order information of words uttered in each conversation by anyone involved in the dialogue.

import org.apache.spark.sql.functions.{collect_list, udf, lit, concat_ws}

val corpusDF = convLines.groupBy($"conversationID",$"movieID")
  .agg(concat_ws(" :-()-: ",collect_list($"text")).alias("corpus"))
  .join(moviesMetaData, "movieID") // and join it to get movie meta data
  .select($"conversationID".as("id"),$"corpus",$"movieTitle",$"movieYear")
  .cache()
import org.apache.spark.sql.functions.{collect_list, udf, lit, concat_ws} corpusDF: org.apache.spark.sql.DataFrame = [id: bigint, corpus: string, movieTitle: string, movieYear: string]
corpusDF.count()
res18: Long = 83097
corpusDF.take(5).foreach(println)
[17668,This would be funny - if it wasn't so pathetic. Why, she isn't a day over twenty! :-()-: You're wrong, George. :-()-: I'm not wrong. She told me so. Besides, she wouldn't have to tell me. I'd know anyway. I found out a lot of things last night. I'm not ashamed of it either. It's probably one of the few decent things that's ever happened in this hellish place.,lost horizon,1937] [17598,Cave, eh? Where? :-()-: Over by that hill.,lost horizon,1937] [17663,Something grand and beautiful, George. Something I've been searching for all my life. The answer to the confusion and bewilderment of a lifetime. I've found it, George, and I can't leave it. You mustn't either. :-()-: I don't know what you're talking about. You're carrying around a secret that seems to be eating you up. If you'll only tell me about it. :-()-: I will, George. I want to tell you. I'll burst with it if I don't. It's weird and fantastical and sometimes unbelievable, but so beautiful! Well, as you know, we were kidnapped and brought here . . .,lost horizon,1937] [17593,You see? You get the idea? From this reservoir here I can pipe in the whole works. Oh, I'm going to get a great kick out of this. Of course it's just to keep my hand in, but with the equipment we have here, I can put a plumbing system in for the whole village down there. Can rig it up in no time. Do you realize those poor people are still going to the well for water? :-()-: It's unbelievable. :-()-: Think of it! In times like these. :-()-: Say, what about that gold deal? :-()-: Huh? :-()-: Gold. You were going to� :-()-: Oh - that! That can wait. Nobody's going to run off with it. Say, I've got to get busy. I want to show this whole layout to Chang. So long. Don't you take any wooden nickels. :-()-: All right.,lost horizon,1937] [17658,Let me up! Let me up! :-()-: All right. Sorry, George.,lost horizon,1937]
display(corpusDF)
17668This would be funny - if it wasn't so pathetic. Why, she isn't a day over twenty! :-()-: You're wrong, George. :-()-: I'm not wrong. She told me so. Besides, she wouldn't have to tell me. I'd know anyway. I found out a lot of things last night. I'm not ashamed of it either. It's probably one of the few decent things that's ever happened in this hellish place.lost horizon1937
17598Cave, eh? Where? :-()-: Over by that hill.lost horizon1937
17663Something grand and beautiful, George. Something I've been searching for all my life. The answer to the confusion and bewilderment of a lifetime. I've found it, George, and I can't leave it. You mustn't either. :-()-: I don't know what you're talking about. You're carrying around a secret that seems to be eating you up. If you'll only tell me about it. :-()-: I will, George. I want to tell you. I'll burst with it if I don't. It's weird and fantastical and sometimes unbelievable, but so beautiful! Well, as you know, we were kidnapped and brought here . . .lost horizon1937
17593You see? You get the idea? From this reservoir here I can pipe in the whole works. Oh, I'm going to get a great kick out of this. Of course it's just to keep my hand in, but with the equipment we have here, I can put a plumbing system in for the whole village down there. Can rig it up in no time. Do you realize those poor people are still going to the well for water? :-()-: It's unbelievable. :-()-: Think of it! In times like these. :-()-: Say, what about that gold deal? :-()-: Huh? :-()-: Gold. You were going to� :-()-: Oh - that! That can wait. Nobody's going to run off with it. Say, I've got to get busy. I want to show this whole layout to Chang. So long. Don't you take any wooden nickels. :-()-: All right.lost horizon1937
17658Let me up! Let me up! :-()-: All right. Sorry, George.lost horizon1937
17610That would suit me perfectly. I'm always broke. How did you pay for them? :-()-: Our Valley is very rich in a metal called gold, which fortunately for us is valued very highly in the outside world. So we merely . . . :-()-: �buy and sell? :-()-: Buy and - sell? No, no, pardon me, exchange . :-()-: I see. Gold for ideas. You know Mr. Chang, there's something so simple and naive about all of this that I suspect there has been a shrewd, guiding intelligence somewhere. Whose idea was it? How did it all start? :-()-: That, my dear Conway, is the story of a remarkable man. :-()-: Who? :-()-: A Belgian priest by the name of Father Perrault, the first European to find this place, and a very great man indeed. He is responsible for everything you see here. He built Shangri-La, taught our natives, and began our collection of art. In fact, Shangri-La is Father Perrault. :-()-: When was all this? :-()-: Oh, let me see - way back in 1713, I think it was, that Father Perrault stumbled into the Valley...lost horizon1937
17648What are these people? :-()-: I don't know. I can't get the dialect.lost horizon1937
17562I didn't care for 'sister' last night, and I don't like 'Lovey' this morning. My name is Lovett - Alexander, P. :-()-: I see. :-()-: I see. :-()-: Well, it's a good morning, anyway. :-()-: I'm never conversational before I coffee.lost horizon1937
17681Huh? I give it up. But this not knowing where you're going is exciting anyway. :-()-: Well, Mr. Conway, for a man who is supposed to be a leader, your do- nothing attitude is very disappointing.lost horizon1937
17573How about you Lovey? Come on. Let's you and I play a game of honeymoon bridge. :-()-: I'm thinking. :-()-: Thinking? What about some double solitaire? :-()-: As a matter of fact, I'm very good at double solitaire. :-()-: No kidding? :-()-: Yes. :-()-: Then I'm your man. Come on, Toots.lost horizon1937
17638The power house - they've blown it up! The planes can't land without lights. :-()-: Come on! We'll burn the hangar. That will make light for them!lost horizon1937
17622In that event, we better make arrangements to get some porters immediately. Some means to get us back to civilization. :-()-: Are you so certain you are away from it? :-()-: As far away as I ever want to be. :-()-: Oh, dear.lost horizon1937
17660For heaven's sake, Bob, what's the matter with you? You went out there for the purpose of� :-()-: George. George - do you mind? I'm sorry, but I can't talk about it tonight.lost horizon1937
17633That Conway seemed to belong here. In fact, it was suggested that someone be sent to bring him here. :-()-: That I be brought here? Who had that brilliant idea? :-()-: Sondra Bizet. :-()-: Oh, the girl at the piano? :-()-: Yes. She has read your books and has a profound admiration for you, as have we all. :-()-: Of course I have suspected that our being here is no accident. Furthermore, I have a feeling that we're never supposed to leave. But that, for the moment, doesn't concern me greatly. I'll meet that when it comes. What particularly interests me at present is, why was I brought here? What possible use can I be to an already thriving community? :-()-: We need men like you here, to be sure that our community will continue to thrive. In return for which, Shangri-La has much to give you. You are still, by the world's standards, a youngish man. Yet in the normal course of existence, you can expect twenty or thirty years of gradually diminishing activity. Here, however, in Shangri- La,...lost horizon1937
17628Are you taking me? :-()-: Yes, of course. Certainly. Come on!lost horizon1937
17601And mine's Conway. :-()-: How do you do? :-()-: You've no idea, sir, how unexpected and very welcome you are. My friends and I - and the lady in the plane - left Baskul night before last for Shanghai, but we suddenly found ourselves traveling in the opposite direction�lost horizon1937
17650What is it? Has he fainted? :-()-: It looks like it. Smell those fumes?lost horizon1937
17634Yes, of course, your brother is a problem. It was to be expected. :-()-: I knew you'd understand. That's why I came to you for help. :-()-: You must not look to me for help. Your brother is no longer my problem. He is now your problem, Conway. :-()-: Mine? :-()-: Because, my son, I am placing in your hands the future and destiny of Shangri-La. For I am going to die.lost horizon1937
17580Hey Lovey, come here! Lovey, I asked for a glass of wine and look what I got. Come on, sit down. :-()-: So that's where you are. I might of known it. No wonder you couldn't hear me. :-()-: You were asked to have a glass of wine. Sit down! :-()-: And be poisoned out here in the open? :-()-: Certainly not!lost horizon1937
17699Why, he's speaking English. :-()-: English!lost horizon1937
17645Don't worry, George. Nothing's going to happen. I'll fall right into line. I'll be the good little boy that everybody wants me to be. I'll be the best little Foreign Secretary we ever had, just because I haven't the nerve to be anything else. :-()-: Do try to sleep, Bob. :-()-: Huh? Oh, sure, Freshie. Good thing, sleep.lost horizon1937
17564He might have lost his way. :-()-: Of course. That's what I told them last night. You can't expect a man to sail around in the dark.[5] During this George has been looking around - he rises.lost horizon1937
17570Yeah? If this be execution, lead me to it. :-()-: That's what they do with cattle just before the slaughter. Fatten them. :-()-: Uh-huh. You're a scream, Lovey. :-()-: Please don't call me Lovey.lost horizon1937
17646Oh, stop it! :-()-: The bloke up there looks a Chinese, or a Mongolian, or something.lost horizon1937
17587It's better than freezing to death down below, isn't it? :-()-: I'll say.lost horizon1937
17652What is it? :-()-: See that spot? :-()-: Yes. :-()-: That's where we were this morning. He had it marked. Right on the border of Tibet. Here's where civilization ends. We must be a thousand miles beyond it - just a blank on the map. :-()-: What's it mean? :-()-: It means we're in unexplored country - country nobody ever reached.lost horizon1937
17582�then the bears came right into the bedroom and the little baby bear said, "Oh, somebody's been sleeping in my bed." And then the mama bear said, "Oh dear, somebody's been sleeping in my bed!" And then the big papa bear, he roared, "And somebody's been sleeping in my bed!" Well, you have to admit the poor little bears were in a quandary! :-()-: I'm going to sleep in my bed. Come on, Lovey! :-()-: They were in a quandary, and� :-()-: Come on, Lovey. :-()-: Why? Why 'come on' all the time? What's the matter? Are you going to be a fuss budget all your life? Here, drink it up! Aren't you having any fun? Where was I? :-()-: In a quandary.lost horizon1937
17647George, what are you going to do? :-()-: I'm going to drag him out and force him to tell us what his game is.lost horizon1937
17577Yes. :-()-: Sounds like a stall to me.lost horizon1937
17642Just what I needed too. :-()-: You? :-()-: Just this once, Bob. I feel like celebrating. Just think of it, Bob - a cruiser sent to Shanghai just to take you back to England. You know what it means. Here you are. Don't bother about those cables now. I want you to drink with me. Gentlemen, I give you Robert Conway - England's new Foreign Secretary.lost horizon1937
17605It's three thousand feet, practically straight down to the floor of the valley. The Valley of the Blue Moon, as we call it. There are over two thousand people in the Valley besides those here in Shangri-La. :-()-: Who and what is Shangri-La? You? :-()-: Goodness, no! :-()-: So there are others? :-()-: Oh, yes. :-()-: Who, for instance? :-()-: In time you will meet them all.lost horizon1937
17551Conway's gone again! Run out! Listen to this! From Gainsford. :-()-: Let me have it. "Aboard the S.S. Manchuria. Last night Conway seemed to recover his memory. Kept talking about Shangri- La, telling a fantastic story about a place in Tibet. Insisted upon returning there at once. Locked him in room but he escaped us and jumped ship during night at Singapore. Am leaving ship myself to overtake him, as fearful of his condition. Wrote down details of Conway's story about Shangri-La which I am forwarding. Lord Gainsford."lost horizon1937
17697What do you want him to do? :-()-: I don't know. I'm a paleontologist, not a Foreign Secretary.lost horizon1937
17670She was kidnapped and brought here two years ago just as we were, Bob. :-()-: I don't believe it! I can't believe it. She's lying. You're lying. You're lying! Every word you've been saying is a lie! Come on, say it!lost horizon1937
17643Hurray! :-()-: How I'm going to bask in reflected glory! People are going to point to me and say, "There goes George Conway - brother of the Foreign Secretary." :-()-: Don't talk nonsense. Give me the bottle.lost horizon1937
17616Please calm yourself. You'll soon be well if you do. :-()-: I don't need any advice from you! Get me a doctor! :-()-: I'm sorry, but we have no doctors here. :-()-: No doctors? That's fine. That's just fine. :-()-: Please let me help you. :-()-: Sure, you can help me! You can help me jump over that cliff! I've been looking and looking at the bottom of that mountain, but I haven't got the nerve to jump! :-()-: You shouldn't be looking at the bottom of the mountain. Why don't you try looking up at the top sometimes? :-()-: Don't preach that cheap, second- hand stuff to me! Go on, beat it. Beat it!lost horizon1937
17589Hey, hurry up, you slow-pokes - I'm starved! :-()-: Please! Please! Do not wait for me! I eat so very little.lost horizon1937
17600There you are! Barnard, you'd better get your things together. We're leaving. :-()-: Leaving? :-()-: Yes. I've just been talking with the porters. They're going to take us. We've got clothing, food, everything. Come on! :-()-: When are you going to start? :-()-: Right this very minute! The porters are waiting for us on the plateau. And that Chinaman thought he could stop me. Come along. :-()-: I think I'll stick around. I'll leave with the porters on their next trip. :-()-: You mean you don't want to go? :-()-: Well - I'm� :-()-: I see. You're afraid of going to jail, eh? :-()-: Well, no. You see, I got this plumbing business� :-()-: All right! If you insist on being an idiot, I'm not going to waste time coaxing you. How about you? :-()-: Oh, no - you don't want to go yet, honey. She'll stick around too. Is that right?lost horizon1937
17687I really only brought you here to show you my pigeons! :-()-: Don't worry about the pigeons. From now on, you can put flutes on my tail and bells on my feet!lost horizon1937
17655Well - I heard that if you want a man's wife, she's yours, if he's got any manners. :-()-: Nothing about the porters yet? :-()-: Porters? :-()-: Good heavens, Bob, we've been here two weeks and we haven't found out a thing. :-()-: Well, we haven't been murdered in our beds yet, George, have we? :-()-: I'm afraid the porters are just a myth. I guess we never will know why we're here, or how long we're going to be held prisoners. :-()-: Shhh!lost horizon1937
17693Would you like to wring my little neck? :-()-: I'd love it! :-()-: Why?lost horizon1937
17614You must prevail upon him not to attempt the journey. He could never get through that country alive. :-()-: I can't let him go alone. It's suicide!lost horizon1937
17679I guess we're in for it. :-()-: In for what? :-()-: I don't know. He must have had some purpose in taking the plane away from Fenner. When he lands, we'll find out. :-()-: You mean to tell me you're not going to do anything until we land? :-()-: What do you suggest? :-()-: Why, you - you� Look here - he may dash us to pieces! :-()-: It might afford you a great deal of relief. Now gentlemen, I'm going back to sleep. Oh, and I was having such a peaceful dream. As soon as he lands, let me know.lost horizon1937
17609But Mr. Chang, all these things - books, instruments, sculpture - do you mean to say they were all brought in over those mountains by porters? :-()-: They were. :-()-: Well, it must have taken� :-()-: Centuries. :-()-: Centuries! Where did you get the money to pay for all those treasures? :-()-: Of course we have no money as you know it. We do not buy or sell or seek personal fortunes because, well, because there is no uncertain future here for which to accumulate it.lost horizon1937
17674Target practice again! One of these days they're going to hit us. :-()-: As long as they keep on aiming at us, we're safe. Come now, child.lost horizon1937
17604And the wine - excellent. :-()-: I'm glad you like it. It's made right here in the valley.lost horizon1937
17669So everyone is serenely happy in Shangri-La? Nobody would ever think of leaving? It's all just so much rot! She's pleaded with me ever since I came here to take her away from this awful place. She's cried in my arms for hours, for fear I'd leave her behind. And what's more, she's made two trips to the plateau to bribe the porters - for me! :-()-: I don't believe it! I don't believe a word of it! :-()-: All right. I'll prove it to you! You believe everything they've told you - without proof! I'll prove my story!lost horizon1937
17588That's what I say. What do you say to a rubber of bridge? I saw some cards in the other room. :-()-: Not for me, thanks. No, I'm too weary.lost horizon1937
17653Hello, George. Cigarette? :-()-: Thanks. I suppose all this comes under the heading of adventure. :-()-: We've had plenty of it the last few days. :-()-: It's far from over, from what I can see. This place gives me the creeps, hidden away like this - no contact with civilization. Bob, you don't seem concerned at all. :-()-: Oh, I'm feeling far too peaceful to be concerned about anything. I think I'm going to like it here. :-()-: You talk as though you intend on staying. :-()-: Something happened to me, when we arrived here, George, that - well - did you ever go to a totally strange place, and feel certain that you've been there before? :-()-: What are you talking about? :-()-: I don't know. :-()-: You're a strange bird. No wonder Gainsford calls you the man who always wanted to see what was on the other side of the hill.lost horizon1937
17556Two years!? :-()-: Yes.lost horizon1937
17702How about you? Do you want to go? :-()-: Go? Where? :-()-: Home. Away from here. I've got porters to take us back. :-()-: Oh, my dear boy, I'm sorry. That's impossible. Why, I have my classes all started. :-()-: I don't care what you've got started. Do you want to go? :-()-: Well - no - I think I'd better wait. Yes, yes. I will. I'll wait. :-()-: You'll wait till you rot! :-()-: Yes. Barney!lost horizon1937
17675Where did you come from? :-()-: I'm Alexander P. Lovett, sir. :-()-: Why aren't you registered through our office?lost horizon1937
17621We were just going to bury him when you came along. :-()-: Pardon me�lost horizon1937
17686As a matter of fact, all I saw was a little boy whistling in the dark. :-()-: A little boy whistling in the dark!? Do you realize that there is a British cruiser waiting at Shanghai, smoke pouring out of its funnels, tugging at its moorings, waiting to take Mr. Conway back to London? Do you know that at this minute there are headlines shrieking all over the world the news that Conway is missing? Does that look like a man whose life is empty? :-()-: Yes. :-()-: You're absolutely right. And I had to come all the way to a pigeon house in Shangri-La to find the only other person in the world who knew it. May I congratulate you?lost horizon1937
17665Look here, Bob, Ever since I can remember, you've looked after me. Now I think you're the one that needs looking after. I'm your brother, Bob. If there's something wrong with you, let me help you. :-()-: Oh, George . . . :-()-: Besides, I - I don't feel like making that trip alone, Bob. :-()-: George, you couldn't possibly stay here, could you? :-()-: I'd go mad! :-()-: George, I may be wrong, I may be a maniac. But I believe in this, and I'm not going to lose it. You know how much I want to help you, but this is bigger, stronger if you like than brotherly love. I'm sorry, George. I'm staying. :-()-: Well, I can't think of anything more to say. Goodbye, Bob.lost horizon1937
17595Hey, what's happened to you? :-()-: Nothing. Why? :-()-: Why, you look beautiful.lost horizon1937
17606For a man who talks a great deal, it's amazing how unenlightening you can be. :-()-: There are some things, my dear Conway, I deeply regret I may not discuss. :-()-: You know, that's the fourth time you've said that today. You should have a record made of it. :-()-: Shall we go inside? I should so like to show you some of our rare treasures.lost horizon1937
17579Yeah. You know. :-()-: Horns? What kind of horns?lost horizon1937
17698What is it? :-()-: Mountain grass. It's good, too. Here, have some. I've read of people lasting thirty days on this stuff.lost horizon1937
17671You say the porters are waiting for us? :-()-: Yes. :-()-: The clothes? :-()-: Yes, everything! :-()-: What about the others? :-()-: I've already asked them. They're afraid to make the trip. We'll have to send an expedition back after them. :-()-: Come on! We're wasting time!lost horizon1937
17644Hello, Freshie. Did you make that report out yet? :-()-: Yes, Bob. :-()-: Did you say we saved ninety white people? :-()-: Yes. :-()-: Hurray for us. Did you say that we left ten thousand natives down there to be annihilated? No, you wouldn't say that. They don't count. :-()-: You'd better try to get some sleep, Bob. :-()-: Just you wait until I'm Foreign Secretary. Can't you just see me, Freshie, with all those other shrewd, little Foreign Secretaries? You see, the trick is to see who can out-talk the other. Everybody wants something for nothing, and if you can't get it with smooth talk, you send an army in. I'm going to fool them, Freshie. I'm not going to have an army. I'm going to disband mine. I'm going to sink my battleships - I'm going to destroy every piece of warcraft. Then when the enemy approaches we'll say, "Come in, gentlemen - what can we do for you?" So then the poor enemy soldiers will stop and think. And what will they think, Freshie? They'll think to themselves - "S...lost horizon1937
17590Yes. Unbosom yourself, Mr. Hyde.[11] :-()-: All right, I will! I'll let my hair down! Why not? It can't make any real difference now. Hey Lovey, were you ever chased by the police?lost horizon1937
17666George, are you sure of the porters? About their taking care of you, I mean? :-()-: Oh yes. It's all set. Maria made the arrangements. :-()-: Maria? :-()-: Yes, the little Russian girl. :-()-: What's she got to do with it? :-()-: She's going with me.lost horizon1937
17704You promised to come for tea yesterday. I waited for so long. :-()-: I'm sorry. I haven't even got any cigarettes left! :-()-: I'll make some for you! You will come today? :-()-: Perhaps. :-()-: Please say you will. The days are so very long and lonely without you. Please . . . :-()-: All right, I'll be there. :-()-: Thank you. :-()-: You'll tell me some of the things I want to know, won't you? You'll tell me who runs this place. And why we were kidnapped. And what they're going to do with us!lost horizon1937
17569Something tells me this means food. Come on! :-()-: I just feel as though I'm being made ready for the executioner.lost horizon1937
17607By the way, what religion do you follow here? :-()-: We follow many.lost horizon1937
17591Chalmers Bryant! :-()-: Bryant's Utilities - that's me.lost horizon1937
17683Oh, please. I hope you're not going to run away this time. :-()-: My name's Sondra. :-()-: hope you'll forgive me for�lost horizon1937
17651He's dead. :-()-: Dead?lost horizon1937
17619You know, it's very, very strange, but when you saw me in the corridor, I was actually on my way to you. I bring the most amazing news. The High Lama wishes to see you, Mr. Conway. :-()-: The High Lama! Who in blazes is he?!lost horizon1937
17684You know, each time I see you, I hear that music. What is it? :-()-: Oh, you mean my pigeons.lost horizon1937
17549Cable from Gainsford. :-()-: Oh, read it! :-()-: "Leaving today for London with Conway aboard S.S. Manchuria. Conway can tell nothing of his experiences. Is suffering from complete loss of memory. Signed, Gainsford."lost horizon1937
17558I have here a discovery that will startle the world. It's the vertebrae from the lumbar of a Megatherium,[4] found in Asia. :-()-: Well, what do you know about that! :-()-: Found in Asia! :-()-: Uh-huh. :-()-: When I get home I shall probably be knighted for it. :-()-: Knighted! You don't say. Do you mind if I take a look at it? :-()-: Not at all.lost horizon1937
17677No. That's not possible! If we had landed, we all would have been awakened. :-()-: Of course. We never left the air. I know - I didn't sleep the whole night long. :-()-: That fellow got on at Baskul. :-()-: What's he doing? Where's he taking us? He may be a maniac for all we know.lost horizon1937
17596Look, honey. We run the pipes through here, and we connect with the main water line here. :-()-: Pipes? Where are you going to get pipes? :-()-: Oh, that's a cinch. I'll show them how to cast pipes out of clay.lost horizon1937
17672Lucky thing for me you snapped out of it, too. You saved my life. I never could have made it alone. :-()-: What was that? :-()-: was saying� :-()-: Can't you shut up? Must you go on babbling like an idiot?lost horizon1937
17618What exactly do you mean by "almost any time now"? :-()-: Well, we've been expecting this particular shipment for the past two years.lost horizon1937
17667You can't take her away from here! :-()-: Why not? :-()-: Because you can't. Do you know what will happen to her if she leaves Shangri-La? She's a fragile thing that can only live where fragile things are loved. Take her out of this valley and she'll fade away like an echo. :-()-: What do you mean - "fade away like an echo"? :-()-: She came here in 1888!lost horizon1937
17678Good. :-()-: What if he refuses? :-()-: We'll smash his face in. That's what we'll do. :-()-: Brilliant! Can anyone here fly a plane?lost horizon1937
17624The High Lama is the only one from whom any information can come. :-()-: Don't believe him, Bob. He's just trying to get out.lost horizon1937
17608To put it simply, I should say that our general belief was in moderation. We preach the virtue of avoiding excesses of every kind, even including� �the excess of virtue itself. :-()-: That's intelligent. :-()-: We find, in the Valley, it makes for better happiness among the natives. We rule with moderate strictness and in return we are satisfied with moderate obedience. As a result, our people are moderately honest and moderately chaste and somewhat more than moderately happy. :-()-: How about law and order? You have no soldiers or police? :-()-: Oh, good heavens, no! :-()-: How do you deal with incorrigibles? Criminals? :-()-: Why, we have no crime here. What makes a criminal? Lack, usually. Avariciousness, envy, the desire to possess something owned by another. There can be no crime where there is a sufficiency of everything. :-()-: You have no disputes over women? :-()-: Only very rarely. You see, it would not be considered good manners to take a woman that another man wanted. :-(...lost horizon1937
17581There you are! :-()-: This doesn't obligate me in any way.lost horizon1937
17673Bob, can't you get them to wait for us? They're leaving us farther behind every day. :-()-: There's nothing that would suit them better than to lose us, but we must go on. Come on.lost horizon1937
17576That's too bad. I got a half million shares. My whole foundation! And now look at me! :-()-: colossal nerve you have sitting there and talking about it so calmly - you, the swindler of thousands of people� :-()-: You know, that's what makes the whole thing so funny. A guy like me starts out in life as a plumber - an ordinary, everyday, slew-footed plumber - and by the use of a little brains, mind you, he builds up a gigantic institution, employs thousands of people, becomes a great civic leader. And then the crash comes - and overnight he's the biggest crook the country ever had. :-()-: You are a thief, sir, and a swindler, and I, for one, will be only too glad to turn you over to the police when we get back.lost horizon1937
17641The next time you're in wild country like this, keep in touch with the British Consul. :-()-: Aha - very good, Freshie.[3] Very good. You'd better put his name on the list and make out a report later.lost horizon1937
17571Some layout they got here. Did you get a load of the rooms? You couldn't do better at the Ritz. :-()-: All the conveniences for the condemned, if you ask me.lost horizon1937
17636Bob! I think I hear motors! :-()-: Colonel, wait a minute, they may be here now! Say George, get down on that field and guide those planes in when they get here. :-()-: Yes.lost horizon1937
17701That's what I mean - mysterious. Mr. Conway, I don't like that man. He's too vague. :-()-: We didn't get much information out of him, did we Bob?lost horizon1937
17566Left us here to rot. That's what they've done. Heroes of the newspapers! :-()-: All right, all right. Keep quiet.lost horizon1937
17631I trust you have been comfortable at Shangri-La, since your arrival. :-()-: Personally, I've enjoyed your community very much. But my friends do not care for this mystery. They are determined to leave as soon as�lost horizon1937
17696Couldn't you arrange to make a little less noise? :-()-: I tell you, we're going west, and Shanghai is east of here! :-()-: Be quiet! Fenner's the best pilot in China. He knows what he's doing. :-()-: It's Fenner.lost horizon1937
17567Where are they? Do you see them? :-()-: Yes! :-()-: Do you think they're cannibals?lost horizon1937
17659What about the porters? :-()-: Porters? :-()-: Didn't you find out anything about the porters? :-()-: Why - I'm sorry - but I�lost horizon1937
17692Ouch! :-()-: You see, it's not a dream. :-()-: You know, sometimes I think that it's the other that's the dream. The outside world. Have you never wanted to go there? :-()-: Goodness, no. From what you tell me about it, it certainly doesn't sound very attractive. :-()-: It's not so bad, really. Some phases are a little sordid, of course. That's only to be expected. :-()-: Why? :-()-: Oh, the usual reasons. A world full of people struggling for existence. :-()-: Struggling, why? :-()-: Well, everybody naturally wants to make a place for himself, accumulate a nest egg, and so on. :-()-: Why? :-()-: You know, if you keep on asking that, we're not going to get anywhere. And don't ask me why. :-()-: I was just going to. :-()-: It's the most annoying word in the English language. Did you ever hear a child torture his parent with it? Mother's little darling musn't stick her fingers in the salad bowl. Why? Because it isn't lady- like to do that. Why? Because that's what forks are made for, da...lost horizon1937
17676Where were you hiding? :-()-: Hiding? Oh, no. Hunting - I was in the interior - hunting fossils. This morning I looked up suddenly� :-()-: I know - and a war broke out right over your head.lost horizon1937
17649Oh George, come on. :-()-: It's not knowing that's so awful, Bob. Not knowing where you're going, or why, or what's waiting when you get there.lost horizon1937
17552Where's the girl? Miss Stone. :-()-: She's remaining in her room. She isn't feeling very well. Now please go on without me. I eat very little.lost horizon1937
17617Of course, the porters will be very well paid - that is, within reason. :-()-: I'm afraid that wouldn't help. You see, we have no porters here. :-()-: No porters here!! :-()-: No.lost horizon1937
17682At the mercy of a mad pilot. :-()-: We'd be eternally grateful if you�lost horizon1937
17639All right, go ahead! We go on to the next plane. Bring out any people that are left. :-()-: Right, Bob.lost horizon1937

Showing the first 1000 rows.

Feature extraction and transformation APIs

We will use the convenient Feature extraction and transformation APIs.

Step 3. Text Tokenization

We will use the RegexTokenizer to split each document into tokens. We can setMinTokenLength() here to indicate a minimum token length, and filter away all tokens that fall below the minimum. See: