007d_SparkSQLProgGuide_HW(Scala)

Loading...

ScaDaMaLe Course site and book

This is an elaboration of the http://spark.apache.org/docs/latest/sql-programming-guide.html by Ivan Sadikov and Raazesh Sainudiin.

Data Sources

Spark Sql Programming Guide

  • Data Sources
    • Generic Load/Save Functions
      • Manually Specifying Options
      • Run SQL on files directly
      • Save Modes
      • Saving to Persistent Tables
    • Parquet Files
      • Loading Data Programmatically
      • Partition Discovery
      • Schema Merging
      • Hive metastore Parquet table conversion
        • Hive/Parquet Schema Reconciliation
        • Metadata Refreshing
      • Configuration
    • JSON Datasets
    • Hive Tables
      • Interacting with Different Versions of Hive Metastore
    • JDBC To Other Databases
    • Troubleshooting

Data Sources

Spark SQL supports operating on a variety of data sources through the DataFrame or DataFrame interfaces. A Dataset can be operated on as normal RDDs and can also be registered as a temporary table. Registering a Dataset as a table allows you to run SQL queries over its data. But from time to time you would need to either load or save Dataset. Spark SQL provides built-in data sources as well as Data Source API to define your own data source and use it read / write data into Spark.

Overview

Spark provides some built-in datasources that you can use straight out of the box, such as Parquet, JSON, JDBC, ORC (available with enabled Hive Support, but this is changing, and ORC will not require Hive support and will work with default Spark session starting from next release), and Text (since Spark 1.6) and CSV (since Spark 2.0, before that it is accessible as a package).

Third-party datasource packages

Community also have built quite a few datasource packages to provide easy access to the data from other formats. You can find list of those packages on http://spark-packages.org/, e.g. Avro, CSV, Amazon Redshit (for Spark < 2.0), XML, NetFlow and many others.

Generic Load/Save functions

In order to load or save DataFrame you have to call either read or write. This will return DataFrameReader or DataFrameWriter depending on what you are trying to achieve. Essentially these classes are entry points to the reading / writing actions. They allow you to specify writing mode or provide additional options to read data source.

// This will return DataFrameReader to read data source
println(spark.read)
 
val df = spark.range(0, 10)
 
// This will return DataFrameWriter to save DataFrame
println(df.write)
// Saving Parquet table in Scala
// DataFrames and tables can be saved as Parquet files, maintaining the schema information
val df_save = spark.table("social_media_usage").select("platform", "visits") // assuming you made the social_media_usage table permanent in previous notebook
df_save.write.mode("overwrite").parquet("/tmp/platforms.parquet")
 
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val df = spark.read.parquet("/tmp/platforms.parquet")
df.show(5)
// in databricks '/tmp/...' is the same as 'dbfs:///tmp/...'
display(dbutils.fs.ls("/tmp/"))
display(dbutils.fs.ls("/tmp/platforms.parquet/")) // note this is a directory with many files in it... files beginning with part have content in possibly many partitions
%py
# Loading Parquet table in Python
dfPy = spark.read.parquet("/tmp/platforms.parquet")
dfPy.show(5)
// Saving JSON dataset in Scala
val df_save = spark.table("social_media_usage").select("platform", "visits")
df_save.write.mode("overwrite").json("/tmp/platforms.json")
 
// Loading JSON dataset in Scala
val df = spark.read.json("/tmp/platforms.json")
df.show(5)
%py
# Loading JSON dataset in Python
dfPy = spark.read.json("/tmp/platforms.json")
dfPy.show(5)

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source. Data sources are specified by their fully qualified name (i.e., org.apache.spark.sql.parquet), but for built-in sources you can also use their short names (json, parquet, jdbc). DataFrames of any type can be converted into other types using this syntax.

val json = sqlContext.read.format("json").load("/tmp/platforms.json")
json.select("platform").show(10)
 
val parquet = sqlContext.read.format("parquet").load("/tmp/platforms.parquet")
parquet.select("platform").show(10)

Run SQL on files directly

Instead of using read API to load a file into DataFrame and query it, you can also query that file directly with SQL.