006_WordCount(Scala)

Word Count on US State of the Union (SoU) Addresses

  • Word Count in big data is the equivalent of Hello World in programming
  • We count the number of occurences of each word in the first and last (2016) SoU addresses.

prerequisite see DO NOW below. You should have loaded data as instructed in scalable-data-science/xtraResources/sdsDatasets.

DO NOW (if not done already)

In your databricks community edition:

  1. In your WorkSpace create a Folder named scalable-data-science
  2. Import the databricks archive file at the following URL:
  3. This should open a structure of directories in with path: /Workspace/scalable-data-science/xtraResources/

An interesting analysis of the textual content of the State of the Union (SoU) addresses by all US presidents was done in:

Fig. 5. A river network captures the flow across history of US political discourse, as perceived by contemporaries. Time moves along the x axis. Clusters on semantic networks of 300 most frequent terms for each of 10 historical periods are displayed as vertical bars. Relations between clusters of adjacent periods are indexed by gray flows, whose density reflects their degree of connection. Streams that connect at any point in history may be considered to be part of the same system, indicated with a single color.

Let us investigate this dataset ourselves!

  1. We first get the source text data by scraping and parsing from http://stateoftheunion.onetwothree.net/texts/index.html as explained in scraping and parsing SoU addresses.
  2. This data is already made available in DBFS, our distributed file system.
  3. We only do the simplest word count with this data in this notebook and will do more sophisticated analyses in the sequel (including topic modeling, etc).

Project Suggestion

Streaming/NLP/Vertex-Programs, etc:

Key Data Management Concepts

The Structure Spectrum

(watch now 1:10):

Structure Spectrum by Anthony Joseph in BerkeleyX/CS100.1x

Here we will be working with unstructured or schema-never data (plain text files).


Files

(watch later 1:43):

Files by Anthony Joseph in BerkeleyX/CS100.1x

DBFS and dbutils - where is this dataset in our distributed file system?

  • Since we are on the databricks cloud, it has a file system called DBFS
  • DBFS is similar to HDFS, the Hadoop distributed file system
  • dbutils allows us to interact with dbfs.
  • The 'display' command displays the list of files in a given directory in the file system.
display(dbutils.fs.ls("dbfs:/datasets/sou")) // Cntrl+Enter to display the files in dbfs:/datasets/sou
dbfs:/datasets/sou/17900108.txt17900108.txt6725
dbfs:/datasets/sou/17901208.txt17901208.txt8427
dbfs:/datasets/sou/17911025.txt17911025.txt14175
dbfs:/datasets/sou/17921106.txt17921106.txt12736
dbfs:/datasets/sou/17931203.txt17931203.txt11668
dbfs:/datasets/sou/17941119.txt17941119.txt17615
dbfs:/datasets/sou/17951208.txt17951208.txt12296
dbfs:/datasets/sou/17961207.txt17961207.txt17340
dbfs:/datasets/sou/17971122.txt17971122.txt12473
dbfs:/datasets/sou/17981208.txt17981208.txt13394
dbfs:/datasets/sou/17991203.txt17991203.txt9236
dbfs:/datasets/sou/18001111.txt18001111.txt8382
dbfs:/datasets/sou/18011208.txt18011208.txt19342
dbfs:/datasets/sou/18021215.txt18021215.txt13003
dbfs:/datasets/sou/18031017.txt18031017.txt14022
dbfs:/datasets/sou/18041108.txt18041108.txt12652
dbfs:/datasets/sou/18051203.txt18051203.txt17190
dbfs:/datasets/sou/18061202.txt18061202.txt17135
dbfs:/datasets/sou/18071027.txt18071027.txt14334
dbfs:/datasets/sou/18081108.txt18081108.txt16225
dbfs:/datasets/sou/18091129.txt18091129.txt11050
dbfs:/datasets/sou/18101205.txt18101205.txt15028
dbfs:/datasets/sou/18111105.txt18111105.txt13941
dbfs:/datasets/sou/18121104.txt18121104.txt19615
dbfs:/datasets/sou/18131207.txt18131207.txt19532
dbfs:/datasets/sou/18140920.txt18140920.txt12632
dbfs:/datasets/sou/18151205.txt18151205.txt19398
dbfs:/datasets/sou/18161203.txt18161203.txt20331
dbfs:/datasets/sou/18171212.txt18171212.txt26236
dbfs:/datasets/sou/18181116.txt18181116.txt26445
dbfs:/datasets/sou/18191207.txt18191207.txt27880
dbfs:/datasets/sou/18201114.txt18201114.txt20503
dbfs:/datasets/sou/18211203.txt18211203.txt34364
dbfs:/datasets/sou/18221203.txt18221203.txt28154
dbfs:/datasets/sou/18231202.txt18231202.txt38329
dbfs:/datasets/sou/18241207.txt18241207.txt49869
dbfs:/datasets/sou/18251206.txt18251206.txt53992
dbfs:/datasets/sou/18261205.txt18261205.txt46482
dbfs:/datasets/sou/18271204.txt18271204.txt42481
dbfs:/datasets/sou/18281202.txt18281202.txt44202
dbfs:/datasets/sou/18291208.txt18291208.txt62923
dbfs:/datasets/sou/18301206.txt18301206.txt90641
dbfs:/datasets/sou/18311206.txt18311206.txt42902
dbfs:/datasets/sou/18321204.txt18321204.txt46879
dbfs:/datasets/sou/18331203.txt18331203.txt46991
dbfs:/datasets/sou/18341201.txt18341201.txt80364
dbfs:/datasets/sou/18351207.txt18351207.txt64395
dbfs:/datasets/sou/18361205.txt18361205.txt73306
dbfs:/datasets/sou/18371205.txt18371205.txt68927
dbfs:/datasets/sou/18381203.txt18381203.txt69880
dbfs:/datasets/sou/18391202.txt18391202.txt80147
dbfs:/datasets/sou/18401205.txt18401205.txt55025
dbfs:/datasets/sou/18411207.txt18411207.txt48792
dbfs:/datasets/sou/18421206.txt18421206.txt49788
dbfs:/datasets/sou/18431206.txt18431206.txt47670
dbfs:/datasets/sou/18441203.txt18441203.txt55494
dbfs:/datasets/sou/18451202.txt18451202.txt95894
dbfs:/datasets/sou/18461208.txt18461208.txt107852
dbfs:/datasets/sou/18471207.txt18471207.txt96912
dbfs:/datasets/sou/18481205.txt18481205.txt127557
dbfs:/datasets/sou/18491204.txt18491204.txt46003
dbfs:/datasets/sou/18501202.txt18501202.txt49823
dbfs:/datasets/sou/18511202.txt18511202.txt79335
dbfs:/datasets/sou/18521206.txt18521206.txt59438
dbfs:/datasets/sou/18531205.txt18531205.txt58031
dbfs:/datasets/sou/18541204.txt18541204.txt61917
dbfs:/datasets/sou/18551231.txt18551231.txt70459
dbfs:/datasets/sou/18561202.txt18561202.txt63906
dbfs:/datasets/sou/18571208.txt18571208.txt82051
dbfs:/datasets/sou/18581206.txt18581206.txt98523
dbfs:/datasets/sou/18591219.txt18591219.txt74089
dbfs:/datasets/sou/18601203.txt18601203.txt84283
dbfs:/datasets/sou/18611203.txt18611203.txt41587
dbfs:/datasets/sou/18621201.txt18621201.txt50008
dbfs:/datasets/sou/18631208.txt18631208.txt37109
dbfs:/datasets/sou/18641206.txt18641206.txt36201
dbfs:/datasets/sou/18651204.txt18651204.txt54781
dbfs:/datasets/sou/18661203.txt18661203.txt44152
dbfs:/datasets/sou/18671203.txt18671203.txt71650
dbfs:/datasets/sou/18681209.txt18681209.txt60650
dbfs:/datasets/sou/18691206.txt18691206.txt46099
dbfs:/datasets/sou/18701205.txt18701205.txt52113
dbfs:/datasets/sou/18711204.txt18711204.txt38805
dbfs:/datasets/sou/18721202.txt18721202.txt23984
dbfs:/datasets/sou/18731201.txt18731201.txt60406
dbfs:/datasets/sou/18741207.txt18741207.txt55136
dbfs:/datasets/sou/18751207.txt18751207.txt73272
dbfs:/datasets/sou/18761205.txt18761205.txt40873
dbfs:/datasets/sou/18771203.txt18771203.txt48620
dbfs:/datasets/sou/18781202.txt18781202.txt48552
dbfs:/datasets/sou/18791201.txt18791201.txt71149
dbfs:/datasets/sou/18801206.txt18801206.txt41294
dbfs:/datasets/sou/18811206.txt18811206.txt24189
dbfs:/datasets/sou/18821204.txt18821204.txt19065
dbfs:/datasets/sou/18831204.txt18831204.txt23860
dbfs:/datasets/sou/18841201.txt18841201.txt55230
dbfs:/datasets/sou/18851208.txt18851208.txt121030
dbfs:/datasets/sou/18861206.txt18861206.txt92873
dbfs:/datasets/sou/18871206.txt18871206.txt31685
dbfs:/datasets/sou/18881203.txt18881203.txt55460

Let us display the head or the first few lines of the file dbfs:/datasets/sou/17900108.txt to see what it contains using dbutils.fs.head method.
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8 as follows:

dbutils.fs.head("dbfs:/datasets/sou/17900108.txt",673) // Cntrl+Enter to get the first 673 bytes of the file (which corresponds to the first five lines)
[Truncated to first 673 bytes] res1: String = "George Washington January 8, 1790 Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity. "
You Try!

Uncomment and modify xxxx in the cell below to read the first 1000 bytes from the file.

//dbutils.fs.head("dbfs:/datasets/sou/17900108.txt", xxxx) // Cntrl+Enter to get the first 1000 bytes of the file

Read the file into Spark Context as an RDD of Strings

  • The textFile method on the available SparkContext sc can read the text file dbfs:/datasets/sou/17900108.txt into Spark and create an RDD of Strings
    • but this is done lazily until an action is taken on the RDD sou17900108!
val sou17900108 = sc.textFile("dbfs:/datasets/sou/17900108.txt") // Cntrl+Enter to read in the textfile as RDD[String]
sou17900108: org.apache.spark.rdd.RDD[String] = dbfs:/datasets/sou/17900108.txt MapPartitionsRDD[136] at textFile at command-112937334110511:1

Perform some actions on the RDD

  • Each String in the RDD sou17900108 represents one line of data from the file and can be made to perform one of the following actions:
    • count the number of elements in the RDD sou17900108 (i.e., the number of lines in the text file dbfs:/datasets/sou/17900108.txt) using sou17900108.count()
    • display the contents of the RDD using take or collect.
sou17900108.count() // <Shift+Enter> to count the number of elements in the RDD
res2: Long = 23
sou17900108.take(5) // <Shift+Enter> to display the first 5 elements of RDD
res3: Array[String] = Array("George Washington ", "", "January 8, 1790 ", "Fellow-Citizens of the Senate and House of Representatives: ", "I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity. ")
sou17900108.take(5).foreach(println) // <Shift+Enter> to display the first 5 elements of RDD line by line
George Washington January 8, 1790 Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity.
sou17900108.collect // <Cntrl+Enter> to display all the elements of RDD
res5: Array[String] = Array("George Washington ", "", "January 8, 1790 ", "Fellow-Citizens of the Senate and House of Representatives: ", "I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity. ", "In resuming your consultations for the general good you can not but derive encouragement from the reflection that the measures of the last session have been as satisfactory to your constituents as the novelty and difficulty of the work allowed you to hope. Still further to realize their expectations and to secure the blessings which a gracious Providence has placed within our reach will in the course of the present important session call for the cool and deliberate exertion of your patriotism, firmness, and wisdom. ", "Among the many interesting objects which will engage your attention that of providing for the common defense will merit particular regard. To be prepared for war is one of the most effectual means of preserving peace. ", "A free people ought not only to be armed, but disciplined; to which end a uniform and well-digested plan is requisite; and their safety and interest require that they should promote such manufactories as tend to render them independent of others for essential, particularly military, supplies. ", "The proper establishment of the troops which may be deemed indispensable will be entitled to mature consideration. In the arrangements which may be made respecting it it will be of importance to conciliate the comfortable support of the officers and soldiers with a due regard to economy. ", "There was reason to hope that the pacific measures adopted with regard to certain hostile tribes of Indians would have relieved the inhabitants of our southern and western frontiers from their depredations, but you will perceive from the information contained in the papers which I shall direct to be laid before you (comprehending a communication from the Commonwealth of Virginia) that we ought to be prepared to afford protection to those parts of the Union, and, if necessary, to punish aggressors. ", "The interests of the United States require that our intercourse with other nations should be facilitated by such provisions as will enable me to fulfill my duty in that respect in the manner which circumstances may render most conducive to the public good, and to this end that the compensation to be made to the persons who may be employed should, according to the nature of their appointments, be defined by law, and a competent fund designated for defraying the expenses incident to the conduct of foreign affairs. ", "Various considerations also render it expedient that the terms on which foreigners may be admitted to the rights of citizens should be speedily ascertained by a uniform rule of naturalization. ", "Uniformity in the currency, weights, and measures of the United States is an object of great importance, and will, I am persuaded, be duly attended to. ", "The advancement of agriculture, commerce, and manufactures by all proper means will not, I trust, need recommendation; but I can not forbear intimating to you the expediency of giving effectual encouragement as well to the introduction of new and useful inventions from abroad as to the exertions of skill and genius in producing them at home, and of facilitating the intercourse between the distant parts of our country by a due attention to the post-office and post-roads. ", "Nor am I less persuaded that you will agree with me in opinion that there is nothing which can better deserve your patronage than the promotion of science and literature. Knowledge is in every country the surest basis of public happiness. In one in which the measures of government receive their impressions so immediately from the sense of the community as in ours it is proportionably essential. ", "To the security of a free constitution it contributes in various ways--by convincing those who are intrusted with the public administration that every valuable end of government is best answered by the enlightened confidence of the people, and by teaching the people themselves to know and to value their own rights; to discern and provide against invasions of them; to distinguish between oppression and the necessary exercise of lawful authority; between burthens proceeding from a disregard to their convenience and those resulting from the inevitable exigencies of society; to discriminate the spirit of liberty from that of licentiousness-- cherishing the first, avoiding the last--and uniting a speedy but temperate vigilance against encroachments, with an inviolable respect to the laws. ", "Whether this desirable object will be best promoted by affording aids to seminaries of learning already established, by the institution of a national university, or by any other expedients will be well worthy of a place in the deliberations of the legislature. ", "Gentlemen of the House of Representatives: ", "I saw with peculiar pleasure at the close of the last session the resolution entered into by you expressive of your opinion that an adequate provision for the support of the public credit is a matter of high importance to the national honor and prosperity. In this sentiment I entirely concur; and to a perfect confidence in your best endeavors to devise such a provision as will be truly with the end I add an equal reliance on the cheerful cooperation of the other branch of the legislature. ", "It would be superfluous to specify inducements to a measure in which the character and interests of the United States are so obviously so deeply concerned, and which has received so explicit a sanction from your declaration. ", "Gentlemen of the Senate and House of Representatives: ", "I have directed the proper officers to lay before you, respectively, such papers and estimates as regard the affairs particularly recommended to your consideration, and necessary to convey to you that information of the state of the Union which it is my duty to afford. ", The welfare of our country is the great object to which our cares and efforts ought to be directed, and I shall derive great satisfaction from a cooperation with you in the pleasing though arduous task of insuring to our fellow citizens the blessings which they have a right to expect from a free, efficient, and equal government.)

Cache the RDD in (distributed) memory to avoid recreating it for each action

  • Above, every time we took an action on the same RDD, the RDD was reconstructed from the textfile.
    • Spark's advantage compared to Hadoop MapReduce is the ability to cache or store the RDD in distributed memory across the nodes.
  • Let's use .cache() after creating an RDD so that it is in memory after the first action (and thus avoid reconstruction for subsequent actions).
    • count the number of elements in the RDD sou17900108 (i.e., the number of lines in the text file dbfs:/datasets/sou/17900108.txt) using sou17900108.count()
    • display the contents of the RDD using take or collect.
// Shift+Enter to read in the textfile as RDD[String] and cache it in distributed memory
val sou17900108 = sc.textFile("dbfs:/datasets/sou/17900108.txt")
sou17900108.cache() // cache the RDD in memory
sou17900108: org.apache.spark.rdd.RDD[String] = dbfs:/datasets/sou/17900108.txt MapPartitionsRDD[138] at textFile at command-112937334110518:2 res6: sou17900108.type = dbfs:/datasets/sou/17900108.txt MapPartitionsRDD[138] at textFile at command-112937334110518:2
sou17900108.count() // Shift+Enter during this count action the RDD is constructed from texfile and cached
res7: Long = 23
sou17900108.count() // Shift+Enter during this count action the cached RDD is used (notice less time taken by the same command)
res8: Long = 23
sou17900108.take(5) // <Cntrl+Enter> to display the first 5 elements of the cached RDD
res9: Array[String] = Array("George Washington ", "", "January 8, 1790 ", "Fellow-Citizens of the Senate and House of Representatives: ", "I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity. ")

Lifecycle of a Spark Program

(watch now 0:23):

Spark Program Lifecycle by Anthony Joseph in BerkeleyX/CS100.1x

Summary
  • create RDDs from:
    • some external data source (such as a distributed file system)
    • parallelized collection in your driver program
  • lazily transform these RDDs into new RDDs
  • cache some of those RDDs for future reuse
  • you perform actions to execute parallel computation to produce results

Transform lines to words

  • We need to loop through each line and split the line into words
  • For now, let us split using whitespace
  • More sophisticated regular expressions can be used to split the line (as we will see soon)
sou17900108
.flatMap(line => line.split(" "))
.take(100)
res10: Array[String] = Array(George, Washington, "", January, 8,, 1790, Fellow-Citizens, of, the, Senate, and, House, of, Representatives:, I, embrace, with, great, satisfaction, the, opportunity, which, now, presents, itself, of, congratulating, you, on, the, present, favorable, prospects, of, our, public, affairs., The, recent, accession, of, the, important, state, of, North, Carolina, to, the, Constitution, of, the, United, States, (of, which, official, information, has, been, received),, the, rising, credit, and, respectability, of, our, country,, the, general, and, increasing, good, will, toward, the, government, of, the, Union,, and, the, concord,, peace,, and, plenty, with, which, we, are, blessed, are, circumstances, auspicious, in, an, eminent, degree, to)

Naive word count

At a first glace, to do a word count of George Washingtons SoU address, we are templed to do the following:

  • just break each line by the whitespace character " " and find the words using a flatMap
  • then do the map with the closure word => (word, 1) to initialize each word with a integer count of 1
    • ie., transform each word to a (key, value) pair or Tuple such as (word, 1)
  • then count all values with the same key (word is the Key in our case) by doing a
    • reduceByKey(_+_)
  • and finally collect() to display the results.
sou17900108
.flatMap( line => line.split(" ") )
.map( word => (word, 1) )
.reduceByKey(_+_)
.collect()
res11: Array[(String, Int)] = Array((country,3), (call,1), (promoted,1), (agree,1), (admitted,1), (House,3), (accession,1), (exertion,1), (have,4), (plenty,1), (consideration,,1), (incident,1), (session,3), (equal,2), (intimating,1), (we,2), (national,3), (been,2), (who,2), (eminent,1), (any,1), (immediately,1), (essential.,1), (western,1), (speedy,1), (institution,1), (respect,2), (discriminate,1), (me,2), (peace.,1), (free,2), (authority;,1), (affairs,1), (are,4), (administration,1), (8,,1), (parts,2), (presents,1), (frontiers,1), (blessings,2), (expressive,1), (introduction,1), (comfortable,1), (our,10), (as,9), (intrusted,1), (circumstances,2), (branch,1), (peace,,1), (contributes,1), (respectability,1), (better,1), (them,2), (independent,1), (proceeding,1), (duty,2), (law,,1), (foreigners,1), (satisfactory,1), (is,10), (convey,1), (appointments,,1), (favorable,1), (Senate,2), (am,2), (certain,1), (sense,1), (shall,2), (proper,3), (recommendation;,1), (States,4), (Virginia),1), (Commonwealth,1), (impressions,1), (they,2), (new,1), (my,2), (rising,1), (expedient,1), (uniting,1), (oppression,1), (free,,1), (hope.,1), (now,1), (due,2), (has,3), (university,,1), (deserve,1), (licentiousness--,1), (safety,1), (degree,1), (persons,1), (giving,1), (learning,1), (depredations,,1), (Washington,1), (conducive,1), (according,1), (need,1), (manufactures,1), (basis,1), (invasions,1), (honor,1), (exigencies,1), (fulfill,1), (directed,1), (render,3), (Still,1), (conduct,1), (southern,1), (well-digested,1), (objects,1), (Indians,1), (truly,1), (cares,1), (welfare,1), (foreign,1), (consultations,1), (resolution,1), (cherishing,1), (means,2), (this,3), (convincing,1), (deemed,1), (right,1), (There,1), (themselves,1), (general,2), (entirely,1), (explicit,1), (defense,1), (only,1), (importance,,1), (opinion,2), (security,1), (exercise,1), (Knowledge,1), (already,1), (established,,1), (particularly,2), (satisfaction,2), (realize,1), (afford.,1), (cheerful,1), (rule,1), (nations,1), (measure,1), (congratulating,1), (hope,1), (can,3), (resuming,1), (relieved,1), (country,,1), (communication,1), (will,,1), (aggressors.,1), (into,1), (there,1), (science,1), (hostile,1), (rights;,1), (trust,,1), (discern,1), (lay,1), (own,1), (reason,1), (directed,,1), (Among,1), (declaration.,1), (essential,,1), (patriotism,,1), (high,1), (mature,1), (laid,1), (surest,1), (compensation,1), (advancement,1), (respecting,1), (seminaries,1), (one,2), (with,11), (obviously,1), (first,,1), (best,3), (January,1), (importance,2), (interesting,1), (proportionably,1), (post-roads.,1), (duly,1), (consideration.,1), (attention,2), (promote,1), (economy.,1), (close,1), (Representatives:,3), (from,12), (other,3), (well,2), (affairs.,2), (interest,1), (afford,1), (further,1), (received),,1), (facilitated,1), (requisite;,1), (affording,1), (allowed,1), (adequate,1), (their,7), (concord,,1), (last,2), (expediency,1), (between,3), (will,13), (useful,1), (valuable,1), (information,3), ("",1), (confidence,2), (war,1), (provisions,1), (designated,1), (providing,1), (encroachments,,1), (important,2), (uniform,2), (vigilance,1), (so,4), (devise,1), (blessed,1), (Uniformity,1), (reliance,1), (it,6), (The,5), (than,1), (others,1), (attended,1), (deeply,1), (troops,1), (fund,1), (embrace,1), (protection,1), (secure,1), (desirable,1), (engage,1), (received,1), (such,4), (literature.,1), (add,1), (recommended,1), (papers,2), (burthens,1), (common,1), (end,4), (preserving,1), (Whether,1), (to.,1), (deliberations,1), (resulting,1), (place,1), (ways--by,1), (supplies.,1), (derive,2), (To,2), (laws.,1), (great,4), ((of,1), (establishment,1), (commerce,,1), (task,1), (armed,,1), (reflection,1), (less,1), (currency,,1), (lawful,1), (last--and,1), (inevitable,1), (expedients,1), (speedily,1), (the,92), (sentiment,1), (not,3), (nothing,1), (enable,1), (manufactories,1), (most,2), (if,1), (considerations,1), (be,20), (punish,1), (all,1), (contained,1), (though,1), (legislature.,2), (toward,1), (credit,2), (superfluous,1), (disregard,1), (rights,1), (regard,3), (but,5), (official,1), (deliberate,1), (skill,1), (persuaded,1), (itself,1), (increasing,1), (distinguish,1), (necessary,2), (Nor,1), (George,1), (on,3), (distant,1), (against,2), (would,2), (perfect,1), (before,2), (at,2), (object,3), (estimates,1), (them;,1), (should,,1), (interests,2), (Union,,2), (may,5), (government,3), (ascertained,1), (good,,1), (gracious,1), (or,1), (insuring,1), (I,11), (aids,1), (intercourse,2), (Union,1), (of,68), (respectively,,1), (fellow,1), (reach,1), (Various,1), (1790,1), (saw,1), (answered,1), (producing,1), (encouragement,2), (Carolina,1), (particular,1), (Fellow-Citizens,1), (inducements,1), (auspicious,1), (arrangements,1), (difficulty,1), (pacific,1), (opportunity,1), (prosperity.,2), (patronage,1), (A,1), (plan,1), (which,18), (cooperation,2), (you,,1), (also,1), (inhabitants,1), (competent,1), (require,2), (should,3), (tend,1), (genius,1), (naturalization.,1), (character,1), (promotion,1), (for,7), (Gentlemen,2), (teaching,1), (worthy,1), (placed,1), (effectual,2), (present,2), (entitled,1), (your,9), (inventions,1), (terms,1), (North,1), (cool,1), (happiness.,1), (officers,2), (Providence,1), (people,2), (abroad,1), (pleasure,1), (expect,1), (facilitating,1), (was,1), (merit,1), (community,1), (endeavors,1), (arduous,1), (exertions,1), (peculiar,1), (society;,1), (firmness,,1), (pleasing,1), (by,11), (expectations,1), (tribes,1), (efforts,1), (defined,1), (inviolable,1), (It,1), (value,1), (an,5), (soldiers,1), (temperate,1), (sanction,1), (disciplined;,1), (recent,1), (provision,2), (conciliate,1), (made,2), (constitution,1), (agriculture,,1), (concerned,,1), (enlightened,1), (novelty,1), (people,,1), (adopted,1), (efficient,,1), (defraying,1), (wisdom.,1), (employed,1), (convenience,1), (ought,3), (in,16), (provide,1), (weights,,1), (In,4), (good,2), (those,3), (necessary,,1), (support,2), (manner,1), (public,5), (course,1), (and,,1), (entered,1), (within,1), (ours,1), (receive,1), (prospects,1), (liberty,1), (every,2), (matter,1), (nature,1), (you,10), ((comprehending,1), (prepared,2), (various,1), (avoiding,1), (that,15), (a,20), (many,1), (spirit,1), (expenses,1), (not,,1), (work,1), (state,2), (government.,1), (concur;,1), (to,53), (know,1), (military,,1), (persuaded,,1), (post-office,1), (perceive,1), (Constitution,1), (specify,1), (regard.,1), (and,39), (indispensable,1), (constituents,1), (home,,1), (forbear,1), (United,4), (direct,1), (citizens,2), (measures,4))

Unfortunately, as you can see from the collect above:

  • the words have punctuations at the end which means that the same words are being counted as different words. Eg: importance
  • empty words are being counted

So we need a bit of regex'ing or regular-expression matching (all readily available from Scala via Java String types).

We will cover the three things we want to do with a simple example from Middle Earth!

  • replace all multiple whitespace characters with one white space character " "
  • replace all punction characters we specify within [ and ] such as [,?.!:;] by the empty string "" (i.e., remove these punctuation characters)
  • convert everything to lower-case.
val example = "Master, Master!   It's me, Sméagol... mhrhm*%* But they took away our precious, they wronged us. Gollum will protect us..., Master, it's me Sméagol."
example: String = Master, Master! It's me, Sméagol... mhrhm*%* But they took away our precious, they wronged us. Gollum will protect us..., Master, it's me Sméagol.
example
  .replaceAll("\\s+", " ") //replace multiple whitespace characters (including space, tab, new line, etc.) with one whitespace " "
  .replaceAll("""([,?.!:;])""", "") // replace the following punctions characters: , ? . ! : ; . with the empty string ""
  .toLowerCase() // converting to lower-case
res12: String = master master it's me sméagol mhrhm*%* but they took away our precious they wronged us gollum will protect us master it's me sméagol

More sophisticated word count

We are now ready to do a word count of George Washington's SoU on January 8th 1790 as follows:

val wordCount_sou17900108 = 
 sou17900108
    .flatMap(line => 
         line.replaceAll("\\s+", " ") //replace multiple whitespace characters (including space, tab, new line, etc.) with one whitespace " "
             .replaceAll("""([,?.!:;])""", "") // replace the following punctions characters: , ? . ! : ; . with the empty string ""
             .toLowerCase() // converting to lower-case
             .split(" "))
    .map(x => (x, 1))
    .reduceByKey(_+_)
    
wordCount_sou17900108.collect()
wordCount_sou17900108: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[145] at reduceByKey at command-112937334110531:9 res13: Array[(String, Int)] = Array((university,1), (call,1), (country,4), (promoted,1), (agree,1), (admitted,1), (accession,1), (exertion,1), (plenty,1), (have,4), (incident,1), (session,3), (national,3), (equal,2), (we,2), (intimating,1), (been,2), (who,2), (eminent,1), (any,1), (consideration,2), (immediately,1), (western,1), (speedy,1), (institution,1), (respect,2), (me,2), (discriminate,1), (frontiers,1), (free,3), (affairs,3), (are,4), (economy,1), (administration,1), (parts,2), (presents,1), (blessings,2), (expressive,1), (introduction,1), (comfortable,1), (our,10), (as,9), (intrusted,1), (circumstances,2), (respectability,1), (branch,1), (contributes,1), (better,1), (them,3), (independent,1), (proceeding,1), (received),1), (duty,2), (foreigners,1), (satisfactory,1), (is,10), (convey,1), (commonwealth,1), (favorable,1), (am,2), (certain,1), (january,1), (shall,2), (sense,1), (proper,3), (impressions,1), (disciplined,1), (they,2), (new,1), (my,2), (rising,1), (expedient,1), (uniting,1), (oppression,1), (now,1), (due,2), (has,3), (deserve,1), (licentiousness--,1), (safety,1), (degree,1), (persons,1), (8,1), (giving,1), (concur,1), (learning,1), (conducive,1), (according,1), (need,1), (manufactures,1), (render,3), (invasions,1), (honor,1), (fulfill,1), (directed,2), (basis,1), (southern,1), (conduct,1), (law,1), (well-digested,1), (exigencies,1), (senate,2), (objects,1), (truly,1), (cares,1), (knowledge,1), (concord,1), (foreign,1), (welfare,1), (consultations,1), (resolution,1), (means,2), (cherishing,1), (this,3), (convincing,1), (deemed,1), (right,1), (post-roads,1), (themselves,1), (general,2), (entirely,1), (explicit,1), (defense,1), (only,1), (washington,1), (established,1), (house,3), (first,1), (still,1), (opinion,2), (security,1), (exercise,1), (already,1), (particularly,2), (satisfaction,2), (essential,2), (patriotism,1), (realize,1), (rule,1), (cheerful,1), (nations,1), (measure,1), (congratulating,1), (hope,2), (can,3), (resuming,1), (peace,2), (encroachments,1), (relieved,1), (communication,1), (society,1), (into,1), (there,2), (science,1), (hostile,1), (representatives,3), (discern,1), (lay,1), (own,1), (reason,1), (high,1), (mature,1), (laid,1), (states,4), (compensation,1), (surest,1), (advancement,1), (trust,1), (respecting,1), (declaration,1), (one,2), (with,11), (uniformity,1), (obviously,1), (best,3), (importance,3), (interesting,1), (proportionably,1), (duly,1), (seminaries,1), (aggressors,1), (attention,2), (promote,1), (afford,2), (close,1), (depredations,1), (from,12), (other,3), (well,2), (interest,1), (further,1), (among,1), (virginia),1), (facilitated,1), (appointments,1), (affording,1), (allowed,1), (their,7), (adequate,1), (last,2), (expediency,1), (between,3), (indians,1), (will,14), (information,3), (useful,1), (valuable,1), ("",1), (confidence,2), (war,1), (provisions,1), (designated,1), (providing,1), (important,2), (uniform,2), (so,4), (vigilance,1), (devise,1), (blessed,1), (reliance,1), (agriculture,1), (it,7), (than,1), (others,1), (attended,1), (deeply,1), (troops,1), (fund,1), (desirable,1), (deliberations,1), (nothing,1), (embrace,1), (protection,1), (received,1), (such,4), (engage,1), (secure,1), (add,1), (recommended,1), (papers,2), (burthens,1), (naturalization,1), (end,4), (common,1), (preserving,1), (ways--by,1), (weights,1), (resulting,1), (place,1), (derive,2), (great,4), ((of,1), (expedients,1), (establishment,1), (task,1), (reflection,1), (less,1), (last--and,1), (inevitable,1), (lawful,1), (the,97), (sentiment,1), (speedily,1), (not,4), (enable,1), (manufactories,1), (most,2), (if,1), (considerations,1), (providence,1), (be,20), (home,1), (all,1), (punish,1), (though,1), (persuaded,2), (laws,1), (toward,1), (credit,2), (gentlemen,2), (efficient,1), (superfluous,1), (disregard,1), (rights,2), (regard,4), (but,5), (official,1), (deliberate,1), (skill,1), (contained,1), (itself,1), (increasing,1), (nor,1), (distinguish,1), (necessary,3), (on,3), (distant,1), (against,2), (perfect,1), (would,2), (before,2), (at,2), (object,3), (estimates,1), (united,4), (union,3), (fellow-citizens,1), (interests,2), (may,5), (government,4), (ascertained,1), (armed,1), (gracious,1), (requisite,1), (or,1), (insuring,1), (aids,1), (intercourse,2), (of,68), (fellow,1), (reach,1), (saw,1), (1790,1), (answered,1), (encouragement,2), (producing,1), (prosperity,2), (particular,1), (currency,1), (inducements,1), (auspicious,1), (arrangements,1), (difficulty,1), (pacific,1), (i,11), (opportunity,1), (patronage,1), (plan,1), (military,1), (north,1), (which,18), (cooperation,2), (also,1), (pleasing,1), (sanction,1), (genius,1), (for,7), (should,4), (tend,1), (whether,1), (character,1), (promotion,1), (inhabitants,1), (competent,1), (teaching,1), (worthy,1), (placed,1), (effectual,2), (present,2), (entitled,1), (your,9), (inventions,1), (terms,1), (require,2), (cool,1), (authority,1), (literature,1), (officers,2), (abroad,1), (people,3), (pleasure,1), (expect,1), (facilitating,1), (was,1), (community,1), (merit,1), (endeavors,1), (arduous,1), (exertions,1), (peculiar,1), (concerned,1), (by,11), (temperate,1), (efforts,1), (tribes,1), (inviolable,1), (value,1), (an,5), (soldiers,1), (supplies,1), (expectations,1), (defined,1), (recent,1), (provision,2), (conciliate,1), (happiness,1), (made,2), (constitution,2), (enlightened,1), (novelty,1), (adopted,1), (defraying,1), (carolina,1), (commerce,1), (george,1), (convenience,1), (ought,3), (in,20), (provide,1), (employed,1), (good,3), (those,3), (recommendation,1), (support,2), (manner,1), (public,5), (course,1), (receive,1), (entered,1), (ours,1), (within,1), (liberty,1), (prospects,1), (every,2), (matter,1), (you,11), (nature,1), (avoiding,1), (prepared,2), (legislature,2), (various,2), (that,15), ((comprehending,1), (a,21), (spirit,1), (many,1), (expenses,1), (state,2), (work,1), (to,56), (know,1), (wisdom,1), (post-office,1), (respectively,1), (perceive,1), (specify,1), (firmness,1), (and,40), (forbear,1), (constituents,1), (indispensable,1), (direct,1), (measures,4), (citizens,2))
val top10 = wordCount_sou17900108.sortBy(_._2, false).collect()
top10: Array[(String, Int)] = Array((the,97), (of,68), (to,56), (and,40), (a,21), (be,20), (in,20), (which,18), (that,15), (will,14), (from,12), (with,11), (i,11), (by,11), (you,11), (our,10), (is,10), (as,9), (your,9), (their,7), (it,7), (for,7), (but,5), (may,5), (an,5), (public,5), (country,4), (have,4), (are,4), (states,4), (so,4), (such,4), (end,4), (great,4), (not,4), (regard,4), (united,4), (government,4), (should,4), (measures,4), (session,3), (national,3), (free,3), (affairs,3), (them,3), (proper,3), (has,3), (render,3), (this,3), (house,3), (can,3), (representatives,3), (best,3), (importance,3), (other,3), (between,3), (information,3), (necessary,3), (on,3), (object,3), (union,3), (people,3), (ought,3), (good,3), (those,3), (equal,2), (we,2), (been,2), (who,2), (consideration,2), (respect,2), (me,2), (parts,2), (blessings,2), (circumstances,2), (duty,2), (am,2), (shall,2), (they,2), (my,2), (due,2), (directed,2), (senate,2), (means,2), (general,2), (opinion,2), (particularly,2), (satisfaction,2), (essential,2), (hope,2), (peace,2), (there,2), (one,2), (attention,2), (afford,2), (well,2), (last,2), (confidence,2), (important,2), (uniform,2), (papers,2), (derive,2), (most,2), (persuaded,2), (credit,2), (gentlemen,2), (rights,2), (against,2), (would,2), (before,2), (at,2), (interests,2), (intercourse,2), (encouragement,2), (prosperity,2), (cooperation,2), (effectual,2), (present,2), (require,2), (officers,2), (provision,2), (made,2), (constitution,2), (support,2), (every,2), (prepared,2), (legislature,2), (various,2), (state,2), (citizens,2), (reliance,1), (agriculture,1), (than,1), (others,1), (attended,1), (deeply,1), (troops,1), (fund,1), (desirable,1), (deliberations,1), (nothing,1), (embrace,1), (protection,1), (received,1), (engage,1), (secure,1), (add,1), (recommended,1), (burthens,1), (naturalization,1), (common,1), (preserving,1), (ways--by,1), (weights,1), (resulting,1), (place,1), ((of,1), (expedients,1), (establishment,1), (task,1), (reflection,1), (less,1), (last--and,1), (inevitable,1), (lawful,1), (sentiment,1), (speedily,1), (enable,1), (manufactories,1), (if,1), (considerations,1), (providence,1), (home,1), (all,1), (punish,1), (though,1), (laws,1), (toward,1), (efficient,1), (superfluous,1), (disregard,1), (official,1), (deliberate,1), (skill,1), (contained,1), (itself,1), (increasing,1), (nor,1), (distinguish,1), (distant,1), (perfect,1), (estimates,1), (fellow-citizens,1), (ascertained,1), (armed,1), (gracious,1), (requisite,1), (or,1), (insuring,1), (aids,1), (fellow,1), (reach,1), (saw,1), (1790,1), (answered,1), (producing,1), (particular,1), (currency,1), (inducements,1), (auspicious,1), (arrangements,1), (difficulty,1), (pacific,1), (opportunity,1), (patronage,1), (plan,1), (military,1), (north,1), (also,1), (pleasing,1), (sanction,1), (genius,1), (tend,1), (whether,1), (character,1), (promotion,1), (inhabitants,1), (competent,1), (teaching,1), (worthy,1), (placed,1), (entitled,1), (inventions,1), (terms,1), (cool,1), (authority,1), (literature,1), (abroad,1), (pleasure,1), (expect,1), (facilitating,1), (was,1), (community,1), (merit,1), (endeavors,1), (arduous,1), (exertions,1), (peculiar,1), (concerned,1), (temperate,1), (efforts,1), (tribes,1), (inviolable,1), (value,1), (soldiers,1), (supplies,1), (expectations,1), (defined,1), (recent,1), (conciliate,1), (happiness,1), (enlightened,1), (novelty,1), (adopted,1), (defraying,1), (carolina,1), (commerce,1), (george,1), (convenience,1), (provide,1), (employed,1), (recommendation,1), (manner,1), (course,1), (receive,1), (entered,1), (ours,1), (within,1), (liberty,1), (prospects,1), (matter,1), (nature,1), (avoiding,1), ((comprehending,1), (spirit,1), (many,1), (expenses,1), (work,1), (know,1), (wisdom,1), (post-office,1), (respectively,1), (perceive,1), (specify,1), (firmness,1), (forbear,1), (constituents,1), (indispensable,1), (direct,1), (university,1), (call,1), (promoted,1), (agree,1), (admitted,1), (accession,1), (exertion,1), (plenty,1), (incident,1), (intimating,1), (eminent,1), (any,1), (immediately,1), (western,1), (speedy,1), (institution,1), (discriminate,1), (frontiers,1), (economy,1), (administration,1), (presents,1), (expressive,1), (introduction,1), (comfortable,1), (intrusted,1), (respectability,1), (branch,1), (contributes,1), (better,1), (independent,1), (proceeding,1), (received),1), (foreigners,1), (satisfactory,1), (convey,1), (commonwealth,1), (favorable,1), (certain,1), (january,1), (sense,1), (impressions,1), (disciplined,1), (new,1), (rising,1), (expedient,1), (uniting,1), (oppression,1), (now,1), (deserve,1), (licentiousness--,1), (safety,1), (degree,1), (persons,1), (8,1), (giving,1), (concur,1), (learning,1), (conducive,1), (according,1), (need,1), (manufactures,1), (invasions,1), (honor,1), (fulfill,1), (basis,1), (southern,1), (conduct,1), (law,1), (well-digested,1), (exigencies,1), (objects,1), (truly,1), (cares,1), (knowledge,1), (concord,1), (foreign,1), (welfare,1), (consultations,1), (resolution,1), (cherishing,1), (convincing,1), (deemed,1), (right,1), (post-roads,1), (themselves,1), (entirely,1), (explicit,1), (defense,1), (only,1), (washington,1), (established,1), (first,1), (still,1), (security,1), (exercise,1), (already,1), (patriotism,1), (realize,1), (rule,1), (cheerful,1), (nations,1), (measure,1), (congratulating,1), (resuming,1), (encroachments,1), (relieved,1), (communication,1), (society,1), (into,1), (science,1), (hostile,1), (discern,1), (lay,1), (own,1), (reason,1), (high,1), (mature,1), (laid,1), (compensation,1), (surest,1), (advancement,1), (trust,1), (respecting,1), (declaration,1), (uniformity,1), (obviously,1), (interesting,1), (proportionably,1), (duly,1), (seminaries,1), (aggressors,1), (promote,1), (close,1), (depredations,1), (interest,1), (further,1), (among,1), (virginia),1), (facilitated,1), (appointments,1), (affording,1), (allowed,1), (adequate,1), (expediency,1), (indians,1), (useful,1), (valuable,1), ("",1), (war,1), (provisions,1), (designated,1), (providing,1), (vigilance,1), (devise,1), (blessed,1))

Doing it all together for George Washington and Barrack Obama