// Databricks notebook source exported at Sat, 18 Jun 2016 05:08:17 UTC
Scalable Data Science
prepared by Raazesh Sainudiin and Sivanand Sivaram
The html source url of this databricks notebook and its recorded Uji in context :
Extract, Transform and Load (ETL) of the SoU Addresses
A bit of bash and lynx to achieve the scraping of the state of the union addresses of the US Presidents
by Paul Brouwers
And some Shell-level parsed-data exploration, injection into the distributed file system and testing
by Raazesh Sainudiin
This SoU dataset is used in the following notebooks:
The code below is mainly there to show how the text content of each state of the union address was scraped from the following URL:
Such data acquisition task or ETL is usually the first and crucial step in a data scientist’s workflow.
A data scientist generally does the scraping and parsing of the data by her/himself.
Data ingestion not only allows the scientist to start the analysis but also determines the quality of the analysis by the limits it imposes on the accessible feature space.
We have done this and put the data in the distributed file system for easy loading into our notebooks for further analysis. This keeps us from having to install unix programs like lynx
, sed
, etc. that are needed in the shell script below.
for i in $(lynx --dump http://stateoftheunion.onetwothree.net/texts/index.html | grep texts | grep -v index | sed 's/.*http/http/') ; do lynx --dump $i | tail -n+13 | head -n-14 | sed 's/^\s\+//' | sed -e ':a;N;$!ba;s/\(.\)\n/\1 /g' -e 's/\n/\n\n/' > $(echo $i | sed 's/.*\([0-9]\{8\}\).*/\1/').txt ; done
Or in a more atomic form:
for i in $(lynx --dump http://stateoftheunion.onetwothree.net/texts/index.html \
| grep texts \
| grep -v index \
| sed 's/.*http/http/')
do
lynx --dump $i \
| tail -n+13 \
| head -n-14 \
| sed 's/^\s\+//' \
| sed -e ':a;N;$!ba;s/\(.\)\n/\1 /g' -e 's/\n/\n\n/' \
> $(echo $i | sed 's/.*\([0-9]\{8\}\).*/\1/').txt
done
Don’t re-evaluate!
The following BASH (shell) script can be made to work on databricks cloud directly by installing the dependencies such as lynx
, etc. Since we have already scraped it and put the data in our distributed file system let’s not evaluate or <Ctrl+Enter>
the cell below. The cell is mainly there to show how it can be done (you may want to modify it to scrape other sites for other text data).
%sh
#remove the hash character from the line below to evaluate when needed
#for i in $(lynx --dump http://stateoftheunion.onetwothree.net/texts/index.html | grep texts | grep -v index | sed 's/.*http/http/') ; do lynx --dump $i | tail -n+13 | head -n-14 | sed 's/^\s\+//' | sed -e ':a;N;$!ba;s/\(.\)\n/\1 /g' -e 's/\n/\n\n/' > $(echo $i | sed 's/.*\([0-9]\{8\}\).*/\1/').txt ; done
%sh
pwd && ls && du -sh .
%sh ls /home/ubuntu && du -sh /home/ubuntu
We can just grab the data as a tarball (gnuZipped tar archive) file sou.tar.gz
using wget as follows:
%sh wget http://www.math.canterbury.ac.nz/~r.sainudiin/datasets/public/SOU/sou.tar.gz
%sh
df -h
pwd
%sh
wget http://www.math.canterbury.ac.nz/~r.sainudiin/datasets/public/SOU/sou.tar.gz
%sh
ls
%sh
env
%sh
tar zxvf sou.tar.gz
%sh cd sou && ls
%sh head sou/17900108.txt
%sh tail sou/17900108.txt
%sh head sou/20150120.txt
%sh tail sou/20150120.txt
display(dbutils.fs.ls("dbfs:/"))
display(dbutils.fs.ls("dbfs:/datasets"))
dbutils.fs.mkdirs("dbfs:/datasets/sou") //need not be done again!
display(dbutils.fs.ls("dbfs:/datasets"))
%sh pwd && ls
dbutils.fs.help
dbutils.fs.cp("file:/databricks/driver/sou", "dbfs:/datasets/sou/",recurse=true)
display(dbutils.fs.ls("dbfs:/datasets/sou"))
display(dbutils.fs.ls("dbfs:/datasets/"))
val sou17900108 = sc.textFile("dbfs:/datasets/sou/17900108.txt")
sou17900108.take(5)
sou17900108.collect
sou17900108.takeOrdered(5)
val souAll = sc.wholeTextFiles("dbfs:/datasets/sou/*.txt")
souAll.count
souAll.take(2)