%md
# Word Count on US State of the Union (SoU) Addresses
* Word Count in big data is the equivalent of `Hello World` in programming
* We count the number of occurences of each word in the first and last (2016) SoU addresses.
**prerequisite** see **DO NOW** below. You should have loaded data as instructed in `scalable-data-science/xtraResources/sdsDatasets`.
#### DO NOW (if not done already)
In your databricks community edition:
1. In your `WorkSpace` create a Folder named `scalable-data-science`
2. `Import` the databricks archive file at the following URL:
* [https://github.com/lamastex/scalable-data-science/raw/master/dbcArchives/2017/parts/xtraResources.dbc](https://github.com/lamastex/scalable-data-science/raw/master/dbcArchives/2017/parts/xtraResources.dbc)
3. This should open a structure of directories in with path: `/Workspace/scalable-data-science/xtraResources/`
Word Count on US State of the Union (SoU) Addresses
- Word Count in big data is the equivalent of
Hello World
in programming - We count the number of occurences of each word in the first and last (2016) SoU addresses.
prerequisite see DO NOW below. You should have loaded data as instructed in scalable-data-science/xtraResources/sdsDatasets
.
DO NOW (if not done already)
In your databricks community edition:
- In your
WorkSpace
create a Folder namedscalable-data-science
Import
the databricks archive file at the following URL:- This should open a structure of directories in with path:
/Workspace/scalable-data-science/xtraResources/
%md
An interesting analysis of the textual content of the *State of the Union (SoU)* addresses by all US presidents was done in:
* [Alix Rule, Jean-Philippe Cointet, and Peter S. Bearman, Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790–2014, PNAS 2015 112 (35) 10837-10844; doi:10.1073/pnas.1512221112](http://www.pnas.org/content/112/35/10837.full).

[Fig. 5](http://www.pnas.org/content/112/35/10837.full). A river network captures the flow across history of US political discourse, as perceived by contemporaries. Time moves along the x axis. Clusters on semantic networks of 300 most frequent terms for each of 10 historical periods are displayed as vertical bars. Relations between clusters of adjacent periods are indexed by gray flows, whose density reflects their degree of connection. Streams that connect at any point in history may be considered to be part of the same system, indicated with a single color.
## Let us investigate this dataset ourselves!
1. We first get the source text data by scraping and parsing from [http://stateoftheunion.onetwothree.net/texts/index.html](http://stateoftheunion.onetwothree.net/texts/index.html) as explained in
[scraping and parsing SoU addresses](/#workspace/scalable-data-science/xtraResources/sdsDatasets/scraperUSStateofUnionAddresses).
* This data is already made available in DBFS, our distributed file system.
* We only do the simplest word count with this data in this notebook and will do more sophisticated analyses in the sequel (including topic modeling, etc).
An interesting analysis of the textual content of the State of the Union (SoU) addresses by all US presidents was done in:
Fig. 5. A river network captures the flow across history of US political discourse, as perceived by contemporaries. Time moves along the x axis. Clusters on semantic networks of 300 most frequent terms for each of 10 historical periods are displayed as vertical bars. Relations between clusters of adjacent periods are indexed by gray flows, whose density reflects their degree of connection. Streams that connect at any point in history may be considered to be part of the same system, indicated with a single color.
Let us investigate this dataset ourselves!
- We first get the source text data by scraping and parsing from http://stateoftheunion.onetwothree.net/texts/index.html as explained in scraping and parsing SoU addresses.
- This data is already made available in DBFS, our distributed file system.
- We only do the simplest word count with this data in this notebook and will do more sophisticated analyses in the sequel (including topic modeling, etc).
%md
## Key Data Management Concepts
### The Structure Spectrum
**(watch now 1:10)**:
[](https://www.youtube.com/watch?v=pMSGGZVSwqo?rel=0&autoplay=1&modestbranding=1&start=1&end=70)
Here we will be working with **unstructured** or **schema-never** data (plain text files).
***
### Files
**(watch later 1:43)**:
[](https://www.youtube.com/watch?v=NJyBQ-cQ3Ac?rel=0&autoplay=1&modestbranding=1&start=1)
%md
###DBFS and dbutils - where is this dataset in our distributed file system?
* Since we are on the databricks cloud, it has a file system called DBFS
* DBFS is similar to HDFS, the Hadoop distributed file system
* dbutils allows us to interact with dbfs.
* The 'display' command displays the list of files in a given directory in the file system.
DBFS and dbutils - where is this dataset in our distributed file system?
- Since we are on the databricks cloud, it has a file system called DBFS
- DBFS is similar to HDFS, the Hadoop distributed file system
- dbutils allows us to interact with dbfs.
- The 'display' command displays the list of files in a given directory in the file system.
%md
Let us display the *head* or the first few lines of the file `dbfs:/datasets/sou/17900108.txt` to see what it contains using `dbutils.fs.head` method.
`head(file: String, maxBytes: int = 65536): String` -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8
as follows:
Let us display the head or the first few lines of the file dbfs:/datasets/sou/17900108.txt
to see what it contains using dbutils.fs.head
method.head(file: String, maxBytes: int = 65536): String
-> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8
as follows:
%md
### Read the file into Spark Context as an RDD of Strings
* The `textFile` method on the available `SparkContext` `sc` can read the text file `dbfs:/datasets/sou/17900108.txt` into Spark and create an RDD of Strings
* but this is done lazily until an action is taken on the RDD `sou17900108`!
Read the file into Spark Context as an RDD of Strings
- The
textFile
method on the availableSparkContext
sc
can read the text filedbfs:/datasets/sou/17900108.txt
into Spark and create an RDD of Strings- but this is done lazily until an action is taken on the RDD
sou17900108
!
- but this is done lazily until an action is taken on the RDD
%md
### Perform some actions on the RDD
* Each String in the RDD `sou17900108` represents one line of data from the file and can be made to perform one of the following actions:
* count the number of elements in the RDD `sou17900108` (i.e., the number of lines in the text file `dbfs:/datasets/sou/17900108.txt`) using `sou17900108.count()`
* display the contents of the RDD using `take` or `collect`.
Perform some actions on the RDD
- Each String in the RDD
sou17900108
represents one line of data from the file and can be made to perform one of the following actions:- count the number of elements in the RDD
sou17900108
(i.e., the number of lines in the text filedbfs:/datasets/sou/17900108.txt
) usingsou17900108.count()
- display the contents of the RDD using
take
orcollect
.
- count the number of elements in the RDD
ScaDaMaLe Course site and book