Getting Started

Starting Point: SQLContext

The entry point into all functionality in Spark SQL is the SparkSession class and/or SQLContext/HiveContext. Spark session is created for you as spark when you start spark-shell or pyspark. You will need to create SparkSession usually when building an application (running on production-like on-premises cluster). n this case follow code below to create Spark session.

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("Spark SQL basic example").getOrCreate()

// you could get SparkContext and SQLContext from SparkSession val sc = spark.sparkContext val sqlContext = spark.sqlContext

// This is used to implicitly convert an RDD or Seq to a DataFrame (see examples below) import spark.implicits._

But in Databricks notebook (similar to spark-shell) SparkSession is already created for you and is available as spark.

// Evaluation of the cell by Ctrl+Enter will print spark session available in notebook
spark

res61: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@760b8571

After evaluation you should see something like this:

res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2d0c6c9

In order to enable Hive support use enableHiveSupport() method on builder when constructing Spark session, which provides richer functionality over standard Spark SQL context, for example, usage of Hive user-defined functions or loading and writing data from/into Hive. Note that most of the SQL functionality is available regardless Hive support.

Creating DataFrames

With a SparkSessions, applications can create Dataset or DataFrame from an existing RDD, from a Hive table, or from various datasources.

Just to recap, a DataFrame is a distributed collection of data organized into named columns. You can think of it as an organized into table RDD of case class Row (which is not exactly true). DataFrames, in comparison to RDDs, are backed by rich optimizations, including tracking their own schema, adaptive query execution, code generation including whole stage codegen, extensible Catalyst optimizer, and project Tungsten.

Dataset provides type-safety when working with SQL, since Row is mapped to a case class, so that each column can be referenced by property of that class.

Note that performance for Dataset/DataFrames is the same across languages Scala, Java, Python, and R. This is due to the fact that the planning phase is just language-specific, only logical plan is constructed in Python, and all the physical execution is compiled and executed as JVM bytecode.

// Spark has some of the pre-built methods to create simple Dataset/DataFrame

// 1. Empty Dataset/DataFrame, not really interesting, is it?
println(spark.emptyDataFrame)
println(spark.emptyDataset[Int])

[] [value: int]

// 2. Range of numbers, note that Spark automatically names column as "id"
val range = spark.range(0, 10)

// In order to get a preview of data in DataFrame use "show()"
range.show(3)

+---+ | id| +---+ | 0| | 1| | 2| +---+ only showing top 3 rows range: org.apache.spark.sql.Dataset[Long] = [id: bigint]

You can also use different datasources that will be shown later or load Hive tables directly into Spark.

We have already created a table of social media usage from NYC

See the Appendix section below to create this social_media_usage table from raw data.

First let's make sure this table is available for us. If you don't see social_media_usage as a named table in the output of the next cell then we first need to ingest this dataset. Let's do it using the databricks' GUI for creating Data as done next.

// Let's find out what tables are already available for loading
spark.catalog.listTables.show()

%md

## NYC Social Media Usage Data

This dataset is from [https://datahub.io/JohnSnowLabs/nyc-social-media-usage#readme](https://datahub.io/JohnSnowLabs/nyc-social-media-usage#readme)

The Demographic Reports are produced by the Economic, Demographic and Statistical Research unit within the Countywide Service Integration and Planning Management (CSIPM) Division of the Fairfax County Department of Neighborhood and Community Services. Information produced by the Economic, Demographic and Statistical Research unit is used by every county department, board, authority and the Fairfax County Public Schools. In addition to the small area estimates and forecasts, state and federal data on Fairfax County are collected and summarized, and special studies and Quantitative research are conducted by the unit.

We are going to fetch this data, with slightly simplified column names, from the following URL:

- http://lamastex.org/datasets/public/NYCUSA/social-media-usage.csv

To turn the dataset into a registered table we will load it using the GUI as follows:

- Download it to your local machine / laptop and then use the 'Data' button on the left to upload it (we will try this method now).
  - This will put your data in the `Filestore` in databricks' distributed file system.

### Overview 

Below we will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System (their distributed file system) that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

In other setups, you can have the data in s3 (say in AWS) or in hdfs in your hadoop cluster, etc.

Alternatively, you can use `curl` or `wget` to download it to the local file system in `/databricks/driver` and then load it into `dbfs`, after this you can use read it via `spark` session into a dataframe and register it as a hive table. 

You can also get the data directly from here (but in this case you need to change the column names in the databricks Data upload GUI or programmatically to follow this notebook):

- http://datahub.io/JohnSnowLabs/nyc-social-media-usage

NYC Social Media Usage Data

This dataset is from https://datahub.io/JohnSnowLabs/nyc-social-media-usage#readme

The Demographic Reports are produced by the Economic, Demographic and Statistical Research unit within the Countywide Service Integration and Planning Management (CSIPM) Division of the Fairfax County Department of Neighborhood and Community Services. Information produced by the Economic, Demographic and Statistical Research unit is used by every county department, board, authority and the Fairfax County Public Schools. In addition to the small area estimates and forecasts, state and federal data on Fairfax County are collected and summarized, and special studies and Quantitative research are conducted by the unit.

We are going to fetch this data, with slightly simplified column names, from the following URL:

http://lamastex.org/datasets/public/NYCUSA/social-media-usage.csv

To turn the dataset into a registered table we will load it using the GUI as follows:

Download it to your local machine / laptop and then use the 'Data' button on the left to upload it (we will try this method now).
- This will put your data in the Filestore in databricks' distributed file system.

Overview

Below we will show you how to create and query a table or DataFrame that you uploaded to DBFS. DBFS is a Databricks File System (their distributed file system) that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

In other setups, you can have the data in s3 (say in AWS) or in hdfs in your hadoop cluster, etc.

Alternatively, you can use curl or wget to download it to the local file system in /databricks/driver and then load it into dbfs, after this you can use read it via spark session into a dataframe and register it as a hive table.

You can also get the data directly from here (but in this case you need to change the column names in the databricks Data upload GUI or programmatically to follow this notebook):

http://datahub.io/JohnSnowLabs/nyc-social-media-usage

// File location and type
// You may need to change the file name "social_media_usage-5dbee.csv" depending on your upload's name
//val file_location = "/FileStore/tables/social_media_usage-5dbee.csv"
val file_location = "/FileStore/tables/social_media_usage-5dbee.csv"
val file_type = "csv"

// CSV options
val infer_schema = "true"
val first_row_is_header = "true"
val delimiter = ","

// The applied options are for CSV files. For other file types, these will be ignored.
val socialMediaDF = spark.read.format(file_type) 
  .option("inferSchema", infer_schema) 
  .option("header", first_row_is_header) 
  .option("sep", delimiter) 
  .load(file_location)

socialMediaDF.show(10)

+------+----------+--------------------+-------------------+------+ |agency| platform| url| date|visits| +------+----------+--------------------+-------------------+------+ | OEM| SMS| null|2012-02-17 00:00:00| 61652| | OEM| SMS| null|2012-11-09 00:00:00| 44547| | EDC| Flickr|http://www.flickr...|2012-05-09 00:00:00| null| | NYCHA|Newsletter| null|2012-05-09 00:00:00| null| | DHS| Twitter|www.twitter.com/n...|2012-06-13 00:00:00| 389| | DHS| Twitter|www.twitter.com/n...|2012-08-02 00:00:00| 431| | DOH| Android| Condom Finder|2011-08-08 00:00:00| 5026| | DOT| Android| You The Man|2011-08-08 00:00:00| null| | MOME| Android| MiNY Venor app|2011-08-08 00:00:00| 313| | DOT|Broadcastr| null|2011-08-08 00:00:00| null| +------+----------+--------------------+-------------------+------+ only showing top 10 rows file_location: String = /FileStore/tables/social_media_usage-5dbee.csv file_type: String = csv infer_schema: String = true first_row_is_header: String = true delimiter: String = , socialMediaDF: org.apache.spark.sql.DataFrame = [agency: string, platform: string ... 3 more fields]

// Let's create a view or table

val temp_table_name = "social_media_usage"

socialMediaDF.createOrReplaceTempView(temp_table_name)

temp_table_name: String = social_media_usage

// Let's find out what tables are already available for loading
spark.catalog.listTables.show(100)

With this registered as a temporary view, social_media_usage will only be available to this particular notebook.

If you'd like other users to be able to query this table (in the databricks professional shard - not the free community edition; or in a managed on-premise cluster), you can also create a table from the DataFrame.

Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data. To do so, choose your table name and uncomment the bottom line of the next cell.

val permanent_table_name = "social_media_usage"
df.write.format("parquet").saveAsTable(permanent_table_name)

permanent_table_name: String = social_media_usage

It looks like the table social_media_usage is available as a permanent table (isTemporary set as false), if you have not uncommented the last line in the previous cell (otherwise it will be available from a parquet file as a permanent table - we will see more about parquet in the sequel).

Next let us do the following:

load this table as a DataFrame (yes, the dataframe already exists as socialMediaDF, but we want to make a new DataFrame directly from the table)
print its schema and
show the first 20 rows.

spark.catalog.listTables.show(100)

val df = spark.table("social_media_usage") // Ctrl+Enter

df: org.apache.spark.sql.DataFrame = [agency: string, platform: string ... 3 more fields]

// Ctrl+Enter
df.printSchema() // prints schema of the DataFrame
df.show() // shows first n (default is 20) rows

df.count() // Ctrl+Enter

res75: Long = 5898

val platforms = df.select("platform") // Shift+Enter

platforms: org.apache.spark.sql.DataFrame = [platform: string]

platforms.count() // Shift+Enter to count the number of rows

res77: Long = 5898

platforms.show(5) // Ctrl+Enter to show top 5 rows

+----------+ | platform| +----------+ | SMS| | SMS| | Flickr| |Newsletter| | Twitter| +----------+ only showing top 5 rows

val uniquePlatforms = df.select("platform").distinct() // Shift+Enter

uniquePlatforms: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [platform: string]

uniquePlatforms.count() // Ctrl+Enter to count the number of distinct platforms

res80: Long = 23

display(uniquePlatforms) // Ctrl+Enter to show all rows; use the scroll-bar on the right of the display to see all platforms


nyc.gov
Flickr
Vimeo
iPhone
YouTube
WordPress
SMS
iPhone App
Youtube
Instagram
iPhone app
Linked-In
Twitter
TOTAL
Tumblr
Newsletter
Pinterest
Broadcastr
Android
Foursquare
Google+
Foursquare (Badge Unlock)
Facebook

Spark SQL and DataFrame API

Spark SQL provides DataFrame API that can perform relational operations on both external data sources and internal collections, which is similar to widely used data frame concept in R, but evaluates operations support lazily (remember RDDs?), so that it can perform relational optimizations. This API is also available in Java, Python and R, but some functionality may not be available, although with every release of Spark people minimize this gap.

So we give some examples how to query data in Python and R, but continue with Scala. You can do all DataFrame operations in this notebook using Python or R.

%py
# Ctrl+Enter to evaluate this python cell, recall '#' is the pre-comment character in python
# Using Python to query our "social_media_usage" table
pythonDF = spark.table("social_media_usage").select("platform").distinct()
pythonDF.show(3)

%sql
-- Ctrl+Enter to achieve the same result using standard SQL syntax!
select distinct platform from social_media_usage


nyc.gov
Flickr
Vimeo
iPhone
YouTube
WordPress
SMS
iPhone App
Youtube
Instagram
iPhone app
Linked-In
Twitter
TOTAL
Tumblr
Newsletter
Pinterest
Broadcastr
Android
Foursquare
Google+
Foursquare (Badge Unlock)
Facebook

Now it is time for some tips around how you use select and what the difference is between $"a", col("a"), df("a").

As you probably have noticed by now, you can specify individual columns to select by providing String values in select statement. But sometimes you need to:

distinguish between columns with the same name
use it to filter (actually you can still filter using full String expression)
do some "magic" with joins and user-defined functions (this will be shown later)

So Spark gives you ability to actually specify columns when you select. Now the difference between all those three notations is ... none, those things are just aliases for a Column in Spark SQL, which means following expressions yield the same result:

// Using string expressions
df.select("agency", "visits")

// Using "$" alias for column
df.select($"agency", $"visits")

// Using "col" alias for column
df.select(col("agency"), col("visits"))

// Using DataFrame name for column
df.select(df("agency"), df("visits"))

This "same-difference" applies to filtering, i.e. you can either use full expression to filter, or column as shown in the following example:

// Using column to filter
df.select("visits").filter($"visits" > 100)

// Or you can use full expression as string
df.select("visits").filter("visits > 100")

Note that $"visits" > 100 expression looks amazing, but under the hood it is just another column, and it equals to df("visits").>(100), where, thanks to Scala paradigm > is just another function that you can define.

val sms = df.select($"agency", $"platform", $"visits").filter($"platform" === "SMS")
sms.show() // Ctrl+Enter

+------+--------+------+ |agency|platform|visits| +------+--------+------+ | OEM| SMS| 61652| | OEM| SMS| 44547| | DOE| SMS| 382| | NYCHA| SMS| null| | OEM| SMS| 61652| | DOE| SMS| 382| | NYCHA| SMS| null| | OEM| SMS| 61652| | OEM| SMS| null| | DOE| SMS| null| | NYCHA| SMS| null| | OEM| SMS| null| | DOE| SMS| null| | NYCHA| SMS| null| | DOE| SMS| 382| | NYCHA| SMS| null| | OEM| SMS| 61652| | DOE| SMS| 382| | NYCHA| SMS| null| | OEM| SMS| 61652| +------+--------+------+ only showing top 20 rows sms: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [agency: string, platform: string ... 1 more field]

// Ctrl+Enter Note that we are using "platform = 'SMS'" since it will be evaluated as actual SQL
val sms = df.select(df("agency"), df("platform"), df("visits")).filter("platform = 'SMS'")
sms.show(5)

+------+--------+------+ |agency|platform|visits| +------+--------+------+ | OEM| SMS| 61652| | OEM| SMS| 44547| | DOE| SMS| 382| | NYCHA| SMS| null| | OEM| SMS| 61652| +------+--------+------+ only showing top 5 rows sms: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [agency: string, platform: string ... 1 more field]

Let's next explore some of the functionality that is available by transforming this DataFrame df into a new DataFrame called fixedDF.

First, note that some columns are not exactly what we want them to be.
- visits should not contain null values, but 0s instead.
Let us fix it using some code that is briefly explained here (don't worry if you don't get it completely now, you will get the hang of it by playing more)
- The coalesce function is similar to if-else statement, i.e., if first column in expression is null, then the value of the second column is used and so on.
- lit just means column of constant value (literally speaking!).
- we also remove TOTAL value from platform column.

// Ctrl+Enter to make fixedDF

// import the needed sql functions
import org.apache.spark.sql.functions.{coalesce, lit}

// make the fixedDF DataFrame
val fixedDF = df.
   select(
     $"agency", 
     $"platform", 
     $"url", 
     $"date", 
     coalesce($"visits", lit(0)).as("visits"))
    .filter($"platform" =!= "TOTAL")

fixedDF.printSchema() // print its schema 
// and show the top 20 records fully
fixedDF.show(false) // the false argument does not truncate the rows, so you will not see something like this "anot..."

root |-- agency: string (nullable = true) |-- platform: string (nullable = true) |-- url: string (nullable = true) |-- date: timestamp (nullable = true) |-- visits: integer (nullable = false) +----------+----------+---------------------------------------------------------------------------------------+-------------------+------+ |agency |platform |url |date |visits| +----------+----------+---------------------------------------------------------------------------------------+-------------------+------+ |OEM |SMS |null |2012-02-17 00:00:00|61652 | |OEM |SMS |null |2012-11-09 00:00:00|44547 | |EDC |Flickr |http://www.flickr.com/nycedc |2012-05-09 00:00:00|0 | |NYCHA |Newsletter|null |2012-05-09 00:00:00|0 | |DHS |Twitter |www.twitter.com/nycdhs |2012-06-13 00:00:00|389 | |DHS |Twitter |www.twitter.com/nycdhs |2012-08-02 00:00:00|431 | |DOH |Android |Condom Finder |2011-08-08 00:00:00|5026 | |DOT |Android |You The Man |2011-08-08 00:00:00|0 | |MOME |Android |MiNY Venor app |2011-08-08 00:00:00|313 | |DOT |Broadcastr|null |2011-08-08 00:00:00|0 | |DPR |Broadcastr|http://beta.broadcastr.com/Echo.html?audioId=670026-4001 |2011-08-08 00:00:00|0 | |ENDHT |Facebook |http://www.facebook.com/pages/NYC-Lets-End-Human-Trafficking/125730490795659?sk=wall |2011-08-08 00:00:00|3 | |VAC |Facebook |https://www.facebook.com/pages/NYC-Voter-Assistance-Commission/110226709012110 |2011-08-08 00:00:00|36 | |PlaNYC |Facebook |http://www.facebook.com/pages/New-York-NY/PlaNYC/160454173971169?ref=ts |2011-08-08 00:00:00|47 | |DFTA |Facebook |http://www.facebook.com/pages/NYC-Department-for-the-Aging/109028655823590 |2011-08-08 00:00:00|90 | |energyNYC |Facebook |http://www.facebook.com/EnergyNYC?sk=wall |2011-08-08 00:00:00|105 | |MOIA |Facebook |http://www.facebook.com/ihwnyc |2011-08-08 00:00:00|123 | |City Store|Facebook |http://www.facebook.com/citystorenyc |2011-08-08 00:00:00|119 | |OCDV |Facebook |http://www.facebook.com/pages/NYC-Healthy-Relationship-Training-Academy/134637829901065|2011-08-08 00:00:00|148 | |HIA |Facebook |http://www.facebook.com/pages/New-York-City-Health-Insurance-Link/145920551598 |2011-08-08 00:00:00|197 | +----------+----------+---------------------------------------------------------------------------------------+-------------------+------+ only showing top 20 rows import org.apache.spark.sql.functions.{coalesce, lit} fixedDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [agency: string, platform: string ... 3 more fields]

// Ctrl+Enter to evaluate this UDF which takes a input String called "value"
// and converts it into lower-case if it begins with http and otherwise leaves it as null, so we sort of remove non valid web-urls
val cleanUrl = udf((value: String) => if (value != null && value.startsWith("http")) value.toLowerCase() else null)

cleanUrl: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

// Ctrl+Enter
val cleanedDF = fixedDF.select($"agency", $"platform", cleanUrl($"url").as("url"), $"date", $"visits")

cleanedDF: org.apache.spark.sql.DataFrame = [agency: string, platform: string ... 3 more fields]

// Shift+Enter
cleanedDF.filter($"url".isNull).show(5)

+------+----------+----+-------------------+------+ |agency| platform| url| date|visits| +------+----------+----+-------------------+------+ | OEM| SMS|null|2012-02-17 00:00:00| 61652| | OEM| SMS|null|2012-11-09 00:00:00| 44547| | NYCHA|Newsletter|null|2012-05-09 00:00:00| 0| | DHS| Twitter|null|2012-06-13 00:00:00| 389| | DHS| Twitter|null|2012-08-02 00:00:00| 431| +------+----------+----+-------------------+------+ only showing top 5 rows

// Ctrl+Enter
cleanedDF.filter($"url".isNotNull).show(5, false) // false in .show(5, false) shows rows untruncated

+------+----------+------------------------------------------------------------------------------------+-------------------+------+ |agency|platform |url |date |visits| +------+----------+------------------------------------------------------------------------------------+-------------------+------+ |EDC |Flickr |http://www.flickr.com/nycedc |2012-05-09 00:00:00|0 | |DPR |Broadcastr|http://beta.broadcastr.com/echo.html?audioid=670026-4001 |2011-08-08 00:00:00|0 | |ENDHT |Facebook |http://www.facebook.com/pages/nyc-lets-end-human-trafficking/125730490795659?sk=wall|2011-08-08 00:00:00|3 | |VAC |Facebook |https://www.facebook.com/pages/nyc-voter-assistance-commission/110226709012110 |2011-08-08 00:00:00|36 | |PlaNYC|Facebook |http://www.facebook.com/pages/new-york-ny/planyc/160454173971169?ref=ts |2011-08-08 00:00:00|47 | +------+----------+------------------------------------------------------------------------------------+-------------------+------+ only showing top 5 rows

// Crtl+Enter
// Import Spark SQL function that will give us unique id across all the records in this DataFrame
import org.apache.spark.sql.functions.monotonically_increasing_id

// We append column as SQL function that creates unique ids across all records in DataFrames 
val agencies = cleanedDF.select(cleanedDF("agency"))
                        .distinct()
                        .withColumn("id", monotonically_increasing_id())
agencies.show(5)

+--------------------+-----------+ | agency| id| +--------------------+-----------+ | PlaNYC|34359738368| | HIA|34359738369| |NYC Digital: exte...|34359738370| | NYCGLOBAL|42949672960| | nycgov|68719476736| +--------------------+-----------+ only showing top 5 rows import org.apache.spark.sql.functions.monotonically_increasing_id agencies: org.apache.spark.sql.DataFrame = [agency: string, id: bigint]

// Ctrl+Enter
// And join with the rest of the data, note how join condition is specified 
val anonym = cleanedDF.join(agencies, cleanedDF("agency") === agencies("agency"), "inner").select("id", "platform", "url", "date", "visits")

// We also cache DataFrame since it can be quite expensive to recompute join
anonym.cache()

// Display result
anonym.show(5)

+-------------+----------+--------------------+-------------------+------+ | id| platform| url| date|visits| +-------------+----------+--------------------+-------------------+------+ |1580547964928| SMS| null|2012-02-17 00:00:00| 61652| |1580547964928| SMS| null|2012-11-09 00:00:00| 44547| | 412316860416| Flickr|http://www.flickr...|2012-05-09 00:00:00| 0| |1649267441664|Newsletter| null|2012-05-09 00:00:00| 0| |1529008357376| Twitter| null|2012-06-13 00:00:00| 389| +-------------+----------+--------------------+-------------------+------+ only showing top 5 rows anonym: org.apache.spark.sql.DataFrame = [id: bigint, platform: string ... 3 more fields]

spark.catalog.listTables().show() // look at the available tables

%sql 
-- to remove a TempTable if it exists already
drop table if exists anonym

// Register table for Spark SQL, we also import "month" function 
import org.apache.spark.sql.functions.month

anonym.createOrReplaceTempView("anonym")

import org.apache.spark.sql.functions.month

%sql
-- Interesting. Now let's do some aggregation. Display platform, month, visits
-- Date column allows us to extract month directly

select platform, month(date) as month, sum(visits) as visits from anonym group by platform, month(date)

import org.apache.spark.sql.functions.{dayofmonth, month, row_number, sum}
import org.apache.spark.sql.expressions.Window

val coolDF = anonym.select($"id", $"platform", dayofmonth($"date").as("day"), month($"date").as("month"), $"visits").
  groupBy($"id", $"platform", $"day", $"month").agg(sum("visits").as("visits"))

// Run window aggregation on visits per month and platform
val window = coolDF.select($"id", $"day", $"visits", sum($"visits").over(Window.partitionBy("platform", "month")).as("monthly_visits"))

// Create and register percent table
val percent = window.select($"id", $"day", ($"visits" / $"monthly_visits").as("percent"))

percent.createOrReplaceTempView("percent")

import org.apache.spark.sql.functions.{dayofmonth, month, row_number, sum} import org.apache.spark.sql.expressions.Window coolDF: org.apache.spark.sql.DataFrame = [id: bigint, platform: string ... 3 more fields] window: org.apache.spark.sql.DataFrame = [id: bigint, day: int ... 2 more fields] percent: org.apache.spark.sql.DataFrame = [id: bigint, day: int ... 1 more field]

%sql

-- A little bit of visualization as result of our efforts
select id, day, `percent` from percent where `percent` > 0.3 and day = 2

%sql
-- You also could just use plain SQL to write query above, note that you might need to group by id and day as well.
with aggr as (
  select id, dayofmonth(date) as day, visits / sum(visits) over (partition by (platform, month(date))) as percent
  from anonym
)
select * from aggr where day = 2 and percent > 0.3

Interoperating with RDDs

Spark SQL supports two different methods for converting existing RDDs into DataFrames. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema.

The second method for creating DataFrames is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct DataFrames when the columns and their types are not known until runtime.

Inferring the Schema Using Reflection

The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table.

// Define case class that will be our schema for DataFrame
case class Hubot(name: String, year: Int, manufacturer: String, version: Array[Int], details: Map[String, String])

// You can process a text file, for example, to convert every row to our Hubot, but we will create RDD manually
val rdd = sc.parallelize(
  Array(
    Hubot("Jerry", 2015, "LCorp", Array(1, 2, 3), Map("eat" -> "yes", "sleep" -> "yes", "drink" -> "yes")),
    Hubot("Mozart", 2010, "LCorp", Array(1, 2), Map("eat" -> "no", "sleep" -> "no", "drink" -> "no")),
    Hubot("Einstein", 2012, "LCorp", Array(1, 2, 3), Map("eat" -> "yes", "sleep" -> "yes", "drink" -> "no"))
  )
)

defined class Hubot rdd: org.apache.spark.rdd.RDD[Hubot] = ParallelCollectionRDD[27514] at parallelize at command-112937334110413:5

// In order to convert RDD into DataFrame you need to do this:
val hubots = rdd.toDF()

// Display DataFrame, note how array and map fields are displayed
hubots.printSchema()
hubots.show()

// You can query complex type the same as you query any other column
// By the way you can use `sql` function to invoke Spark SQL to create DataFrame
hubots.createOrReplaceTempView("hubots")

val onesThatEat = sqlContext.sql("select name, details.eat from hubots where details.eat = 'yes'")

onesThatEat.show()

+--------+---+ | name|eat| +--------+---+ | Jerry|yes| |Einstein|yes| +--------+---+ onesThatEat: org.apache.spark.sql.DataFrame = [name: string, eat: string]

Programmatically Specifying the Schema

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

Create an RDD of Rows from the original RDD
Create the schema represented by a StructType and StructField classes matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

import org.apache.spark.sql.types._

// Let's say we have an RDD of String and we need to convert it into a DataFrame with schema "name", "year", and "manufacturer"
// As you can see every record is space-separated.
val rdd = sc.parallelize(
  Array(
    "Jerry 2015 LCorp",
    "Mozart 2010 LCorp",
    "Einstein 2012 LCorp"
  )
)

// Create schema as StructType //
val schema = StructType(
  StructField("name", StringType, false) :: 
  StructField("year", IntegerType, false) :: 
  StructField("manufacturer", StringType, false) :: 
  Nil
)

// Prepare RDD[Row]
val rows = rdd.map { entry => 
  val arr = entry.split("\\s+")
  val name = arr(0)
  val year = arr(1).toInt
  val manufacturer = arr(2)
  
  Row(name, year, manufacturer)
}

// Create DataFrame
val bots = sqlContext.createDataFrame(rows, schema)
bots.printSchema()
bots.show()

root |-- name: string (nullable = false) |-- year: integer (nullable = false) |-- manufacturer: string (nullable = false) +--------+----+------------+ | name|year|manufacturer| +--------+----+------------+ | Jerry|2015| LCorp| | Mozart|2010| LCorp| |Einstein|2012| LCorp| +--------+----+------------+ import org.apache.spark.sql.types._ rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[27519] at parallelize at command-112937334110417:5 schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,false), StructField(year,IntegerType,false), StructField(manufacturer,StringType,false)) rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[27520] at map at command-112937334110417:22 bots: org.apache.spark.sql.DataFrame = [name: string, year: int ... 1 more field]

Creating Datasets

A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. At the core of the Dataset API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation. The tabular representation is stored using Spark’s internal Tungsten binary format, allowing for operations on serialized data and improved memory utilization. Spark 2.2 comes with support for automatically generating encoders for a wide variety of types, including primitive types (e.g. String, Integer, Long), and Scala case classes.

Simply put, you will get all the benefits of DataFrames with fair amount of flexibility of RDD API.

// We can start working with Datasets by using our "hubots" table

// To create Dataset from DataFrame do this (assuming that case class Hubot exists):
val ds = hubots.as[Hubot]
ds.show()

Finally

DataFrames and Datasets can simplify and improve most of the applications pipelines by bringing concise syntax and performance optimizations. We would highly recommend you to check out the official API documentation, specifically around

Unfortunately, this is just a getting started quickly course, and we skip features like custom aggregations, types, pivoting, etc., but if you are keen to know then start from the links above and this notebook and others in this directory.

Appendix

How to download data and make a table

Okay, so how did we actually make table social_media_usage? Databricks allows us to upload/link external data and make it available as registerd SQL table. It involves several steps:

Find interesting set of data - Google can be your friend for most cases here, or you can have your own dataset as CSV file, for example. Good source of data can also be found here: http://www.data.gov/
Download / prepare it to be either on S3, or human-readable format like CSV, or JSON
Go to Databricks cloud (where you log in to use Databricks notebooks) and open tab Tables
On the very top of the left sub-menu you will see button + Create table, click on it
You will see page with drop-down menu of the list of sources you can provide, File means any file (Parquet, Avro, CSV), but it works the best with CSV format
Once you have chosen file and loaded it, you can change column names, or tweak types (mainly for CSV format)
That is it. Just click on final button to create table. After that you can refer to the table using sqlContext.table("YOUR_TABLE_NAME").

SDS-2.x, Scalable Data Engineering Science

Getting Started

Spark Sql Programming Guide

Getting Started

Starting Point: SQLContext

Creating DataFrames

NYC Social Media Usage Data

Overview

Spark SQL and DataFrame API

Interoperating with RDDs

Inferring the Schema Using Reflection

Programmatically Specifying the Schema

Creating Datasets

Finally

Appendix

How to download data and make a table