Spark Sql Programming Guide

Overview
- SQL
- DataFrames
- Datasets
Getting Started
- Starting Point: SQLContext
- Creating DataFrames
- DataFrame Operations
- Running SQL Queries Programmatically
- Creating Datasets
- Interoperating with RDDs
  - Inferring the Schema Using Reflection
  - Programmatically Specifying the Schema
Data Sources
- Generic Load/Save Functions
  - Manually Specifying Options
  - Run SQL on files directly
  - Save Modes
  - Saving to Persistent Tables
- Parquet Files
  - Loading Data Programmatically
  - Partition Discovery
  - Schema Merging
  - Hive metastore Parquet table conversion
    - Hive/Parquet Schema Reconciliation
    - Metadata Refreshing
  - Configuration
- JSON Datasets
- Hive Tables
  - Interacting with Different Versions of Hive Metastore
- JDBC To Other Databases
- Troubleshooting
Performance Tuning
- Caching Data In Memory
- Other Configuration Options
Distributed SQL Engine
- Running the Thrift JDBC/ODBC server
- Running the Spark SQL CLI
SQL Reference

What could one do with these notebooks?

One could read the Spark SQL Programming Guide that is embedded below and also linked above while going through the cells and doing the YouTrys in the following notebooks.

Why might one do it?

This homework/self-study will help you solve the assigned lab and theory exercises in the sequel, much faster by introducing you to some basic knowledge you need about Spark SQL.

When navigating in the html-page embedded as an iframe, as in the cell below, you can:
- click on a link in the displayed html page to see the content of the clicked link and
- then right-click on the page and click on the arrow keys <- and -> to go back or forward.


//This allows easy embedding of publicly available information into any other notebook
//Example usage:
// displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250))
def frameIt( u:String, h:Int ) : String = {
      """<iframe 
 src=""""+ u+""""
 width="95%" height="""" + h + """"
 sandbox>
  <p>
    <a href="http://spark.apache.org/docs/latest/index.html">
      Fallback link for browsers that, unlikely, don't support frames
    </a>
  </p>
</iframe>"""
   }
displayHTML(frameIt("https://spark.apache.org/docs/latest/sql-programming-guide.html",750))

Let's go through the programming guide in databricks now

This is an elaboration of the http://spark.apache.org/docs/latest/sql-programming-guide.html by Ivan Sadikov and Raazesh Sainudiin.

Spark SQL, DataFrames and Datasets Guide

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result, the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell, pyspark shell, or sparkR shell.

SQL

One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.

Datasets and DataFrames

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

Throughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames.

Background and Preparation

If you are unfamiliar with SQL please brush-up from the basic links below.
SQL allows one to systematically explore any structured data (i.e., tables) using queries. This is necessary part of the data science process.

One can use the SQL Reference at https://spark.apache.org/docs/latest/sql-ref.html to learn SQL quickly.


displayHTML(frameIt("https://en.wikipedia.org/wiki/SQL",500))


displayHTML(frameIt("https://en.wikipedia.org/wiki/Apache_Hive#HiveQL",175))


displayHTML(frameIt("https://spark.apache.org/docs/latest/sql-ref.html",700))

sds-3.x/ScaDaMaLe