%md
# Topic Modeling of Movie Dialogs with Latent Dirichlet Allocation
### Let us cluster the conversations from different movies!
This notebook will provide a brief algorithm summary, links for further reading, and an example of how to use LDA for Topic Modeling.
**not tested in Spark 2.2+ or 3.x yet (see 034 notebook for syntactic issues, if any)**
Topic Modeling of Movie Dialogs with Latent Dirichlet Allocation
Let us cluster the conversations from different movies!
This notebook will provide a brief algorithm summary, links for further reading, and an example of how to use LDA for Topic Modeling.
not tested in Spark 2.2+ or 3.x yet (see 034 notebook for syntactic issues, if any)
%md
##Algorithm Summary
- **Task**: Identify topics from a collection of text documents
- **Input**: Vectors of word counts
- **Optimizers**:
- EMLDAOptimizer using [Expectation Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
- OnlineLDAOptimizer using Iterative Mini-Batch Sampling for [Online Variational Bayes](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
Algorithm Summary
- Task: Identify topics from a collection of text documents
- Input: Vectors of word counts
- Optimizers:
- EMLDAOptimizer using Expectation Maximization
- OnlineLDAOptimizer using Iterative Mini-Batch Sampling for Online Variational Bayes
%md
## Links
- Spark API docs
- Scala: [LDA](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
- Python: [LDA](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA)
- [MLlib Programming Guide](http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda)
- [ML Feature Extractors & Transformers](http://spark.apache.org/docs/latest/ml-features.html)
- [Wikipedia: Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
%md
## Readings for LDA
* A high-level introduction to the topic from Communications of the ACM
* [http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf)
* A very good high-level humanities introduction to the topic (recommended by Chris Thomson in English Department at UC, Ilam):
* [http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/](http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/)
Also read the methodological and more formal papers cited in the above links if you want to know more.
Readings for LDA
- A high-level introduction to the topic from Communications of the ACM
- A very good high-level humanities introduction to the topic (recommended by Chris Thomson in English Department at UC, Ilam):
Also read the methodological and more formal papers cited in the above links if you want to know more.
%md
Let's get a bird's eye view of LDA from http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf next.
* See pictures (hopefully you read the paper last night!)
* Algorithm of the generative model (this is unsupervised clustering)
* For a careful introduction to the topic see Section 27.3 and 27.4 (pages 950-970) pf Murphy's *Machine Learning: A Probabilistic Perspective, MIT Press, 2012*.
* We will be quite application focussed or applied here!
Let's get a bird's eye view of LDA from http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf next.
- See pictures (hopefully you read the paper last night!)
- Algorithm of the generative model (this is unsupervised clustering)
- For a careful introduction to the topic see Section 27.3 and 27.4 (pages 950-970) pf Murphy's Machine Learning: A Probabilistic Perspective, MIT Press, 2012.
- We will be quite application focussed or applied here!
//This allows easy embedding of publicly available information into any other notebook
//when viewing in git-book just ignore this block - you may have to manually chase the URL in frameIt("URL").
//Example usage:
// displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250))
def frameIt( u:String, h:Int ) : String = {
"""<iframe
src=""""+ u+""""
width="95%" height="""" + h + """"
sandbox>
<p>
<a href="http://spark.apache.org/docs/latest/index.html">
Fallback link for browsers that, unlikely, don't support frames
</a>
</p>
</iframe>"""
}
displayHTML(frameIt("http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/",900))
displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250))
displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Model",600))
displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Mathematical_definition",910))
%md
## Probabilistic Topic Modeling Example
This is an outline of our Topic Modeling workflow. Feel free to jump to any subtopic to find out more.
- Step 0. Dataset Review
- Step 1. Downloading and Loading Data into DBFS
- (Step 1. only needs to be done once per shard - see details at the end of the notebook for Step 1.)
- Step 2. Loading the Data and Data Cleaning
- Step 3. Text Tokenization
- Step 4. Remove Stopwords
- Step 5. Vector of Token Counts
- Step 6. Create LDA model with Online Variational Bayes
- Step 7. Review Topics
- Step 8. Model Tuning - Refilter Stopwords
- Step 9. Create LDA model with Expectation Maximization
- Step 10. Visualize Results
Probabilistic Topic Modeling Example
This is an outline of our Topic Modeling workflow. Feel free to jump to any subtopic to find out more.
- Step 0. Dataset Review
- Step 1. Downloading and Loading Data into DBFS
- (Step 1. only needs to be done once per shard - see details at the end of the notebook for Step 1.)
- Step 2. Loading the Data and Data Cleaning
- Step 3. Text Tokenization
- Step 4. Remove Stopwords
- Step 5. Vector of Token Counts
- Step 6. Create LDA model with Online Variational Bayes
- Step 7. Review Topics
- Step 8. Model Tuning - Refilter Stopwords
- Step 9. Create LDA model with Expectation Maximization
- Step 10. Visualize Results
%md
## Step 0. Dataset Review
In this example, we will use the [Cornell Movie Dialogs Corpus](https://people.mpi-sws.org/~cristian/Cornell_Movie-Dialogs_Corpus.html).
Here is the `README.txt`:
***
***
Cornell Movie-Dialogs Corpus
Distributed together with:
"Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs"
Cristian Danescu-Niculescu-Mizil and Lillian Lee
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.
(this paper is included in this zip file)
NOTE: If you have results to report on these corpora, please send email to cristian@cs.cornell.edu or llee@cs.cornell.edu so we can add you to our list of people using this data. Thanks!
Contents of this README:
A) Brief description
B) Files description
C) Details on the collection procedure
D) Contact
A) Brief description:
This corpus contains a metadata-rich collection of fictional conversations extracted from raw movie scripts:
- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances
- movie metadata included:
- genres
- release year
- IMDB rating
- number of IMDB votes
- IMDB rating
- character metadata included:
- gender (for 3,774 characters)
- position on movie credits (3,321 characters)
B) Files description:
In all files the field separator is " +++$+++ "
- movie_titles_metadata.txt
- contains information about each movie title
- fields:
- movieID,
- movie title,
- movie year,
- IMDB rating,
- no. IMDB votes,
- genres in the format ['genre1','genre2',...,'genreN']
- movie_characters_metadata.txt
- contains information about each movie character
- fields:
- characterID
- character name
- movieID
- movie title
- gender ("?" for unlabeled cases)
- position in credits ("?" for unlabeled cases)
- movie_lines.txt
- contains the actual text of each utterance
- fields:
- lineID
- characterID (who uttered this phrase)
- movieID
- character name
- text of the utterance
- movie_conversations.txt
- the structure of the conversations
- fields
- characterID of the first character involved in the conversation
- characterID of the second character involved in the conversation
- movieID of the movie in which the conversation occurred
- list of the utterances that make the conversation, in chronological
order: ['lineID1','lineID2',...,'lineIDN']
has to be matched with movie_lines.txt to reconstruct the actual content
- raw_script_urls.txt
- the urls from which the raw sources were retrieved
C) Details on the collection procedure:
We started from raw publicly available movie scripts (sources acknowledged in
raw_script_urls.txt). In order to collect the metadata necessary for this study
and to distinguish between two script versions of the same movie, we automatically
matched each script with an entry in movie database provided by IMDB (The Internet
Movie Database; data interfaces available at http://www.imdb.com/interfaces). Some
amount of manual correction was also involved. When more than one movie with the same
title was found in IMBD, the match was made with the most popular title
(the one that received most IMDB votes)
After discarding all movies that could not be matched or that had less than 5 IMDB
votes, we were left with 617 unique titles with metadata including genre, release
year, IMDB rating and no. of IMDB votes and cast distribution. We then identified
the pairs of characters that interact and separated their conversations automatically
using simple data processing heuristics. After discarding all pairs that exchanged
less than 5 conversational exchanges there were 10,292 left, exchanging 220,579
conversational exchanges (304,713 utterances). After automatically matching the names
of the 9,035 involved characters to the list of cast distribution, we used the
gender of each interpreting actor to infer the fictional gender of a subset of
3,321 movie characters (we raised the number of gendered 3,774 characters through
manual annotation). Similarly, we collected the end credit position of a subset
of 3,321 characters as a proxy for their status.
D) Contact:
Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil)
***
***
Step 0. Dataset Review
In this example, we will use the Cornell Movie Dialogs Corpus.
Here is the README.txt
:
Cornell Movie-Dialogs Corpus
Distributed together with:
"Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs" Cristian Danescu-Niculescu-Mizil and Lillian Lee Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.
(this paper is included in this zip file)
NOTE: If you have results to report on these corpora, please send email to cristian@cs.cornell.edu or llee@cs.cornell.edu so we can add you to our list of people using this data. Thanks!
Contents of this README:
A) Brief description
B) Files description
C) Details on the collection procedure
D) Contact
A) Brief description:
This corpus contains a metadata-rich collection of fictional conversations extracted from raw movie scripts:
- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances
- movie metadata included:
- genres - release year - IMDB rating - number of IMDB votes - IMDB rating
- character metadata included:
- gender (for 3,774 characters) - position on movie credits (3,321 characters)
B) Files description:
In all files the field separator is " +++$+++ "
movie_titles_metadata.txt
- contains information about each movie title - fields: - movieID, - movie title, - movie year, - IMDB rating, - no. IMDB votes, - genres in the format ['genre1','genre2',...,'genreN']
movie_characters_metadata.txt
- contains information about each movie character - fields: - characterID - character name - movieID - movie title - gender ("?" for unlabeled cases) - position in credits ("?" for unlabeled cases)
movie_lines.txt
- contains the actual text of each utterance - fields: - lineID - characterID (who uttered this phrase) - movieID - character name - text of the utterance
movie_conversations.txt
- the structure of the conversations - fields - characterID of the first character involved in the conversation - characterID of the second character involved in the conversation - movieID of the movie in which the conversation occurred - list of the utterances that make the conversation, in chronological order: ['lineID1','lineID2',...,'lineIDN'] has to be matched with movie_lines.txt to reconstruct the actual content
raw_script_urls.txt
- the urls from which the raw sources were retrieved
C) Details on the collection procedure:
We started from raw publicly available movie scripts (sources acknowledged in raw_script_urls.txt). In order to collect the metadata necessary for this study and to distinguish between two script versions of the same movie, we automatically matched each script with an entry in movie database provided by IMDB (The Internet Movie Database; data interfaces available at http://www.imdb.com/interfaces). Some amount of manual correction was also involved. When more than one movie with the same title was found in IMBD, the match was made with the most popular title (the one that received most IMDB votes)
After discarding all movies that could not be matched or that had less than 5 IMDB votes, we were left with 617 unique titles with metadata including genre, release year, IMDB rating and no. of IMDB votes and cast distribution. We then identified the pairs of characters that interact and separated their conversations automatically using simple data processing heuristics. After discarding all pairs that exchanged less than 5 conversational exchanges there were 10,292 left, exchanging 220,579 conversational exchanges (304,713 utterances). After automatically matching the names of the 9,035 involved characters to the list of cast distribution, we used the gender of each interpreting actor to infer the fictional gender of a subset of 3,321 movie characters (we raised the number of gendered 3,774 characters through manual annotation). Similarly, we collected the end credit position of a subset of 3,321 characters as a proxy for their status.
D) Contact:
Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil)
%md
## Step 2. Loading the Data and Data Cleaning
We have already used the wget command to download the file, and put it in our distributed file system (this process takes about 1 minute). To repeat these steps or to download data from another source follow the steps at the bottom of this worksheet on **Step 1. Downloading and Loading Data into DBFS**.
Let's make sure these files are in dbfs now:
Step 2. Loading the Data and Data Cleaning
We have already used the wget command to download the file, and put it in our distributed file system (this process takes about 1 minute). To repeat these steps or to download data from another source follow the steps at the bottom of this worksheet on Step 1. Downloading and Loading Data into DBFS.
Let's make sure these files are in dbfs now:
// this is where the data resides in dbfs (see below to download it first, if you go to a new shard!)
display(dbutils.fs.ls("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/"))
Showing all 6 rows.
sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt").top(5).foreach(println)
// Load text file, leave out file paths, convert all strings to lowercase
val conversationsRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt").zipWithIndex()
conversationsRaw.top(5).foreach(println) // the first five Strings in the RDD
ScaDaMaLe Course site and book