006_WordCount(Scala)

Loading...

ScaDaMaLe Course site and book

Word Count on US State of the Union (SoU) Addresses

  • Word Count in big data is the equivalent of Hello World in programming
  • We count the number of occurences of each word in the first and last (2016) SoU addresses.

prerequisite see DO NOW below. You should have loaded data as instructed in scalable-data-science/xtraResources/sdsDatasets.

DO NOW (if not done already)

In your databricks community edition:

  1. In your WorkSpace create a Folder named scalable-data-science
  2. Import the databricks archive file at the following URL:
  3. This should open a structure of directories in with path: /Workspace/scalable-data-science/xtraResources/

An interesting analysis of the textual content of the State of the Union (SoU) addresses by all US presidents was done in:

Fig. 5. A river network captures the flow across history of US political discourse, as perceived by contemporaries. Time moves along the x axis. Clusters on semantic networks of 300 most frequent terms for each of 10 historical periods are displayed as vertical bars. Relations between clusters of adjacent periods are indexed by gray flows, whose density reflects their degree of connection. Streams that connect at any point in history may be considered to be part of the same system, indicated with a single color.

Let us investigate this dataset ourselves!

  1. We first get the source text data by scraping and parsing from http://stateoftheunion.onetwothree.net/texts/index.html as explained in scraping and parsing SoU addresses.
  2. This data is already made available in DBFS, our distributed file system.
  3. We only do the simplest word count with this data in this notebook and will do more sophisticated analyses in the sequel (including topic modeling, etc).

Key Data Management Concepts

The Structure Spectrum

(watch now 1:10):

Structure Spectrum by Anthony Joseph in BerkeleyX/CS100.1x

Here we will be working with unstructured or schema-never data (plain text files).


Files

(watch later 1:43):

Files by Anthony Joseph in BerkeleyX/CS100.1x

DBFS and dbutils - where is this dataset in our distributed file system?

  • Since we are on the databricks cloud, it has a file system called DBFS
  • DBFS is similar to HDFS, the Hadoop distributed file system
  • dbutils allows us to interact with dbfs.
  • The 'display' command displays the list of files in a given directory in the file system.
display(dbutils.fs.ls("dbfs:/"))
display(dbutils.fs.ls("dbfs:/datasets/sou")) 
display(dbutils.fs.ls("dbfs:/datasets/sou")) // Cntrl+Enter to display the files in dbfs:/datasets/sou

Let us display the head or the first few lines of the file dbfs:/datasets/sou/17900108.txt to see what it contains using dbutils.fs.head method.
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8 as follows:

dbutils.fs.head("dbfs:/datasets/sou/17900108.txt",673) // Cntrl+Enter to get the first 673 bytes of the file (which corresponds to the first five lines)
You Try!

Uncomment and modify xxxx in the cell below to read the first 1000 bytes from the file.

//dbutils.fs.head("dbfs:/datasets/sou/17900108.txt", xxxx) // Cntrl+Enter to get the first 1000 bytes of the file

Read the file into Spark Context as an RDD of Strings

  • The textFile method on the available SparkContext sc can read the text file dbfs:/datasets/sou/17900108.txt into Spark and create an RDD of Strings
    • but this is done lazily until an action is taken on the RDD sou17900108!
val sou17900108 = sc.textFile("dbfs:/datasets/sou/17900108.txt") // Cntrl+Enter to read in the textfile as RDD[String]

Perform some actions on the RDD

  • Each String in the RDD sou17900108 represents one line of data from the file and can be made to perform one of the following actions:
    • count the number of elements in the RDD sou17900108 (i.e., the number of lines in the text file dbfs:/datasets/sou/17900108.txt) using sou17900108.count()
    • display the contents of the RDD using take or collect.