007c_SparkSQLProgGuide_HW(Scala)

Loading...

ScaDaMaLe Course site and book

This is an elaboration of the http://spark.apache.org/docs/latest/sql-programming-guide.html by Ivan Sadikov and Raazesh Sainudiin.

Getting Started - Exercise

After having gone through the simple example dataset in the programming guide, let's try a slightly larger dataset next.

Let us first create a table of social media usage from NYC

See the Load Data section to create this social_media_usage table from raw data.

First let's make sure this table is available for us. If you don't see social_media_usage as a named table in the output of the next cell then we first need to ingest this dataset. Let's do it using the databricks' GUI for creating Data as done next.

// Let's find out what tables are already available for loading
spark.catalog.listTables.show(50)

NYC Social Media Usage Data

This dataset is from https://datahub.io/JohnSnowLabs/nyc-social-media-usage#readme

The Demographic Reports are produced by the Economic, Demographic and Statistical Research unit within the Countywide Service Integration and Planning Management (CSIPM) Division of the Fairfax County Department of Neighborhood and Community Services. Information produced by the Economic, Demographic and Statistical Research unit is used by every county department, board, authority and the Fairfax County Public Schools. In addition to the small area estimates and forecasts, state and federal data on Fairfax County are collected and summarized, and special studies and Quantitative research are conducted by the unit.

We are going to fetch this data, with slightly simplified column names, from the following URL:

To turn the dataset into a registered table we will load it using the GUI as follows:

  • Download it to your local machine / laptop and then use the 'Data' button on the left to upload it (we will try this method now).
    • This will put your data in the Filestore in databricks' distributed file system.

Overview

Below we will show you how to create and query a table or DataFrame that you uploaded to DBFS. DBFS is a Databricks File System (their distributed file system) that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

In other setups, you can have the data in s3 (say in AWS) or in hdfs in your hadoop cluster, etc.

Alternatively, you can use curl or wget to download it to the local file system in /databricks/driver and then load it into dbfs, after this you can use read it via spark session into a dataframe and register it as a hive table.

You can also get the data directly from here (but in this case you need to change the column names in the databricks Data upload GUI or programmatically to follow this notebook):

Load Data

How to uoload csv file and make a table in databricks

Okay, so how did we actually make table social_media_usage? Databricks allows us to upload/link external data and make it available as registerd SQL table. It involves several steps:

  1. Dowload this social-media-usage.csv file from the following URL to your laptop:
  2. Go to Databricks cloud (where you log in to use Databricks notebooks) and open tab Data on the left panel
  3. On the very top of the left sub-menu you will see button +Add Data, click on it
  4. Choose Upload File for Data Sources by Browse or Drag and Drop, where File means any file (Parquet, Avro, CSV), but it works the best with CSV format
  5. Upload social-media-usage.csv file you just downloaded to databricks
  6. Just note the path to the uploaded file, for example in my case:

    File uploaded to /FileStore/tables/social_media_usage.csv

 
// File location and type
// You may need to change the file_location "social_media_usage-5dbee.csv" depending on your location given by
// File uploaded to /FileStore/tables/social_media_usage.csv
val file_location = "/FileStore/tables/social_media_usage.csv"
val file_type = "csv"
 
// CSV options
val infer_schema = "true"
val first_row_is_header = "true"
val delimiter = ","
 
// The applied options are for CSV files. For other file types, these will be ignored.
val socialMediaDF = spark.read.format(file_type) 
  .option("inferSchema", infer_schema) 
  .option("header", first_row_is_header) 
  .option("sep", delimiter) 
  .load(file_location)
 
socialMediaDF.show(10)
// Let's create a view or table
 
val temp_table_name = "social_media_usage"
 
socialMediaDF.createOrReplaceTempView(temp_table_name)
// Let's find out what tables are already available for loading
spark.catalog.listTables.show(100)

With this registered as a temporary view, social_media_usage will only be available to this particular notebook.

If you'd like other users to be able to query this table (in the databricks professional shard - not the free community edition; or in a managed on-premise cluster), you can also create a table from the DataFrame.

Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data. To do so, choose your table name and use saveAsTable as done in the next cell.

val permanent_table_name = "social_media_usage"
socialMediaDF.write.format("parquet").saveAsTable(permanent_table_name)
// Let's find out what tables are already available for loading
// spark.catalog.listTables.show(100)

It looks like the table social_media_usage is available as a permanent table (isTemporary set as false), if you have not uncommented the last line in the previous cell (otherwise it will be available from a parquet file as a permanent table - we will see more about parquet in the sequel).

Next let us do the following:

  • load this table as a DataFrame (yes, the dataframe already exists as socialMediaDF, but we want to make a new DataFrame directly from the table)
  • print its schema and
  • show the first 20 rows.
spark.catalog.listTables.show(100)
val df = spark.table("social_media_usage") // Ctrl+Enter