SDS-2.2, Scalable Data Science

RAAZ at WORK - unfinished.

This is an elaboration of the http://spark.apache.org/docs/latest/sql-programming-guide.html by Ivan Sadikov and Raazesh Sainudiin.

Any contributions in this 'databricksification' of the programming guide are most welcome. Please feel free to send pull-requests or just fork and push yourself at/from https://github.com/lamastex/scalable-data-science.

NOTE: The links that do not have standard URLs for hyper-text transfer protocol, qualified here by (http) or (https), are in general internal links and will/should work if you follow the instructions in the lectures (from the YouTube play list, watched sequential in chronological order that is linked from https://lamastex.github.io/scalable-data-science/sds/2/2/) on how to download the .dbc archive for the course and upload it into your community edition with the correctly named expected directory structures.

Spark Sql Programming Guide

Overview
- SQL
- DataFrames
- Datasets
Getting Started
- Starting Point: SQLContext
- Creating DataFrames
- DataFrame Operations
- Running SQL Queries Programmatically
- Creating Datasets
- Interoperating with RDDs
  - Inferring the Schema Using Reflection
  - Programmatically Specifying the Schema
Data Sources
- Generic Load/Save Functions
  - Manually Specifying Options
  - Run SQL on files directly
  - Save Modes
  - Saving to Persistent Tables
- Parquet Files
  - Loading Data Programmatically
  - Partition Discovery
  - Schema Merging
  - Hive metastore Parquet table conversion
    - Hive/Parquet Schema Reconciliation
    - Metadata Refreshing
  - Configuration
- JSON Datasets
- Hive Tables
  - Interacting with Different Versions of Hive Metastore
- JDBC To Other Databases
- Troubleshooting
Performance Tuning
- Caching Data In Memory
- Other Configuration Options
Distributed SQL Engine
- Running the Thrift JDBC/ODBC server
- Running the Spark SQL CLI

sds-3.x/ScaDaMaLe

SDS-2.2, Scalable Data Science

RAAZ at WORK - unfinished.

Spark Sql Programming Guide