Word Count on US State of the Union (SoU) Addresses

Word Count in big data is the equivalent of Hello World in programming
We count the number of occurences of each word in the first and last (2016) SoU addresses.

prerequisite see DO NOW below. You should have loaded data as instructed in scalable-data-science/xtraResources/sdsDatasets.

DO NOW (if not done already)

In your databricks community edition:

In your WorkSpace create a Folder named scalable-data-science
Import the databricks archive file at the following URL:
- https://github.com/lamastex/scalable-data-science/raw/master/dbcArchives/2017/parts/xtraResources.dbc
This should open a structure of directories in with path: /Workspace/scalable-data-science/xtraResources/

An interesting analysis of the textual content of the State of the Union (SoU) addresses by all US presidents was done in:

Alix Rule, Jean-Philippe Cointet, and Peter S. Bearman, Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790–2014, PNAS 2015 112 (35) 10837-10844; doi:10.1073/pnas.1512221112.

Fig. 5. A river network captures the flow across history of US political discourse, as perceived by contemporaries. Time moves along the x axis. Clusters on semantic networks of 300 most frequent terms for each of 10 historical periods are displayed as vertical bars. Relations between clusters of adjacent periods are indexed by gray flows, whose density reflects their degree of connection. Streams that connect at any point in history may be considered to be part of the same system, indicated with a single color.

Let us investigate this dataset ourselves!

We first get the source text data by scraping and parsing from http://stateoftheunion.onetwothree.net/texts/index.html as explained in scraping and parsing SoU addresses.
This data is already made available in DBFS, our distributed file system.
We only do the simplest word count with this data in this notebook and will do more sophisticated analyses in the sequel (including topic modeling, etc).

display(dbutils.fs.ls("dbfs:/"))

display(dbutils.fs.ls("dbfs:/datasets/sou")) 

display(dbutils.fs.ls("dbfs:/datasets/sou")) // Cntrl+Enter to display the files in dbfs:/datasets/sou

dbutils.fs.head("dbfs:/datasets/sou/17900108.txt",673) // Cntrl+Enter to get the first 673 bytes of the file (which corresponds to the first five lines)

//dbutils.fs.head("dbfs:/datasets/sou/17900108.txt", xxxx) // Cntrl+Enter to get the first 1000 bytes of the file

val sou17900108 = sc.textFile("dbfs:/datasets/sou/17900108.txt") // Cntrl+Enter to read in the textfile as RDD[String]

006_WordCount(Scala)

Word Count on US State of the Union (SoU) Addresses

DO NOW (if not done already)

Let us investigate this dataset ourselves!

Key Data Management Concepts

The Structure Spectrum

Files

DBFS and dbutils - where is this dataset in our distributed file system?

You Try!

Read the file into Spark Context as an RDD of Strings

Perform some actions on the RDD