%md
This is an elaboration of the [http://spark.apache.org/docs/latest/sql-programming-guide.html](http://spark.apache.org/docs/latest/sql-programming-guide.html) by Ivan Sadikov and Raazesh Sainudiin.
# Getting Started - Exercise
After having gone through the simple example dataset in the programming guide, let's try a slightly larger dataset next.
This is an elaboration of the http://spark.apache.org/docs/latest/sql-programming-guide.html by Ivan Sadikov and Raazesh Sainudiin.
Getting Started - Exercise
After having gone through the simple example dataset in the programming guide, let's try a slightly larger dataset next.
%md
Let us first create a table of social media usage from NYC
> See the **Load Data** section to create this `social_media_usage` table from raw data.
First let's make sure this table is available for us. If you don't see `social_media_usage` as a `name`d table in the output of the next cell then we first need to ingest this dataset. Let's do it using the databricks' GUI for creating `Data` as done next.
Let us first create a table of social media usage from NYC
See the Load Data section to create this
social_media_usage
table from raw data.
First let's make sure this table is available for us. If you don't see social_media_usage
as a name
d table in the output of the next cell then we first need to ingest this dataset. Let's do it using the databricks' GUI for creating Data
as done next.
%md
## NYC Social Media Usage Data
This dataset is from [https://datahub.io/JohnSnowLabs/nyc-social-media-usage#readme](https://datahub.io/JohnSnowLabs/nyc-social-media-usage#readme)
The Demographic Reports are produced by the Economic, Demographic and Statistical Research unit within the Countywide Service Integration and Planning Management (CSIPM) Division of the Fairfax County Department of Neighborhood and Community Services. Information produced by the Economic, Demographic and Statistical Research unit is used by every county department, board, authority and the Fairfax County Public Schools. In addition to the small area estimates and forecasts, state and federal data on Fairfax County are collected and summarized, and special studies and Quantitative research are conducted by the unit.
We are going to fetch this data, with slightly simplified column names, from the following URL:
- http://lamastex.org/datasets/public/NYCUSA/social-media-usage.csv
To turn the dataset into a registered table we will load it using the GUI as follows:
- Download it to your local machine / laptop and then use the 'Data' button on the left to upload it (we will try this method now).
- This will put your data in the `Filestore` in databricks' distributed file system.
### Overview
Below we will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System (their distributed file system) that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.
In other setups, you can have the data in s3 (say in AWS) or in hdfs in your hadoop cluster, etc.
Alternatively, you can use `curl` or `wget` to download it to the local file system in `/databricks/driver` and then load it into `dbfs`, after this you can use read it via `spark` session into a dataframe and register it as a hive table.
You can also get the data directly from here (but in this case you need to change the column names in the databricks Data upload GUI or programmatically to follow this notebook):
- http://datahub.io/JohnSnowLabs/nyc-social-media-usage
NYC Social Media Usage Data
This dataset is from https://datahub.io/JohnSnowLabs/nyc-social-media-usage#readme
The Demographic Reports are produced by the Economic, Demographic and Statistical Research unit within the Countywide Service Integration and Planning Management (CSIPM) Division of the Fairfax County Department of Neighborhood and Community Services. Information produced by the Economic, Demographic and Statistical Research unit is used by every county department, board, authority and the Fairfax County Public Schools. In addition to the small area estimates and forecasts, state and federal data on Fairfax County are collected and summarized, and special studies and Quantitative research are conducted by the unit.
We are going to fetch this data, with slightly simplified column names, from the following URL:
To turn the dataset into a registered table we will load it using the GUI as follows:
- Download it to your local machine / laptop and then use the 'Data' button on the left to upload it (we will try this method now).
- This will put your data in the
Filestore
in databricks' distributed file system.
- This will put your data in the
Overview
Below we will show you how to create and query a table or DataFrame that you uploaded to DBFS. DBFS is a Databricks File System (their distributed file system) that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.
In other setups, you can have the data in s3 (say in AWS) or in hdfs in your hadoop cluster, etc.
Alternatively, you can use curl
or wget
to download it to the local file system in /databricks/driver
and then load it into dbfs
, after this you can use read it via spark
session into a dataframe and register it as a hive table.
You can also get the data directly from here (but in this case you need to change the column names in the databricks Data upload GUI or programmatically to follow this notebook):
%md
## Load Data
### How to uoload csv file and make a table in databricks
Okay, so how did we actually make table `social_media_usage`? Databricks allows us to upload/link external data and make it available as registerd SQL table. It involves several steps:
1. Dowload this `social-media-usage.csv` file from the following URL to your laptop:
- http://lamastex.org/datasets/public/NYCUSA/social-media-usage.csv
- Go to Databricks cloud (where you log in to use Databricks notebooks) and open tab **Data** on the left panel
- On the very top of the left sub-menu you will see button **+Add Data**, click on it
- Choose **Upload File** for Data Sources by **Browse** or **Drag and Drop**, where **File** means any file (Parquet, Avro, CSV), but it works the best with CSV format
- Upload `social-media-usage.csv` file you just downloaded to databricks
- Just note the path to the uploaded file, for example in my case:
> File uploaded to `/FileStore/tables/social_media_usage.csv`
Load Data
How to uoload csv file and make a table in databricks
Okay, so how did we actually make table social_media_usage
? Databricks allows us to upload/link external data and make it available as registerd SQL table. It involves several steps:
- Dowload this
social-media-usage.csv
file from the following URL to your laptop: - Go to Databricks cloud (where you log in to use Databricks notebooks) and open tab Data on the left panel
- On the very top of the left sub-menu you will see button +Add Data, click on it
- Choose Upload File for Data Sources by Browse or Drag and Drop, where File means any file (Parquet, Avro, CSV), but it works the best with CSV format
- Upload
social-media-usage.csv
file you just downloaded to databricks - Just note the path to the uploaded file, for example in my case:
File uploaded to
/FileStore/tables/social_media_usage.csv
// File location and type
// You may need to change the file_location "social_media_usage-5dbee.csv" depending on your location given by
// File uploaded to /FileStore/tables/social_media_usage.csv
val file_location = "/FileStore/tables/social_media_usage.csv"
val file_type = "csv"
// CSV options
val infer_schema = "true"
val first_row_is_header = "true"
val delimiter = ","
// The applied options are for CSV files. For other file types, these will be ignored.
val socialMediaDF = spark.read.format(file_type)
.option("inferSchema", infer_schema)
.option("header", first_row_is_header)
.option("sep", delimiter)
.load(file_location)
socialMediaDF.show(10)
%md
With this registered as a temporary view, `social_media_usage` will only be available to this particular notebook.
If you'd like other users to be able to query this table (in the databricks professional shard - not the free community edition; or in a managed on-premise cluster), you can also create a table from the DataFrame.
Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
To do so, choose your table name and use `saveAsTable` as done in the next cell.
With this registered as a temporary view, social_media_usage
will only be available to this particular notebook.
If you'd like other users to be able to query this table (in the databricks professional shard - not the free community edition; or in a managed on-premise cluster), you can also create a table from the DataFrame.
Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
To do so, choose your table name and use saveAsTable
as done in the next cell.
%md
It looks like the table `social_media_usage` is available as a permanent table (`isTemporary` set as `false`), if you have not uncommented the last line in the previous cell (otherwise it will be available from a parquet file as a permanent table - we will see more about parquet in the sequel).
Next let us do the following:
* load this table as a DataFrame (yes, the dataframe already exists as `socialMediaDF`, but we want to make a new DataFrame directly from the table)
* print its schema and
* show the first 20 rows.
It looks like the table social_media_usage
is available as a permanent table (isTemporary
set as false
), if you have not uncommented the last line in the previous cell (otherwise it will be available from a parquet file as a permanent table - we will see more about parquet in the sequel).
Next let us do the following:
- load this table as a DataFrame (yes, the dataframe already exists as
socialMediaDF
, but we want to make a new DataFrame directly from the table) - print its schema and
- show the first 20 rows.
ScaDaMaLe Course site and book