This is an elaboration of the http://spark.apache.org/docs/latest/sql-programming-guide.html by Ivan Sadikov and Raazesh Sainudiin.
Any contributions in this 'databricksification' of the programming guide are most welcome. Please feel free to send pull-requests or just fork and push yourself at/from https://github.com/lamastex/scalable-data-science.
NOTE: The links that do not have standard URLs for hyper-text transfer protocol, qualified here by (http) or (https), are in general internal links and will/should work if you follow the instructions in the lectures (from the YouTube play list, watched sequential in chronological order that is linked from https://lamastex.github.io/scalable-data-science/sds/2/2/) on how to download the .dbc
archive for the course and upload it into your community edition with the correctly named expected directory structures.
Spark Sql Programming Guide
- Overview
- SQL
- DataFrames
- Datasets
- Getting Started
- Starting Point: SQLContext
- Creating DataFrames
- DataFrame Operations
- Running SQL Queries Programmatically
- Creating Datasets
- Interoperating with RDDs
- Inferring the Schema Using Reflection
- Programmatically Specifying the Schema
- Data Sources
- Generic Load/Save Functions
- Manually Specifying Options
- Run SQL on files directly
- Save Modes
- Saving to Persistent Tables
- Parquet Files
- Loading Data Programmatically
- Partition Discovery
- Schema Merging
- Hive metastore Parquet table conversion
- Hive/Parquet Schema Reconciliation
- Metadata Refreshing
- Configuration
- JSON Datasets
- Hive Tables
- Interacting with Different Versions of Hive Metastore
- JDBC To Other Databases
- Troubleshooting
- Generic Load/Save Functions
- Performance Tuning
- Caching Data In Memory
- Other Configuration Options
- Distributed SQL Engine
- Running the Thrift JDBC/ODBC server
- Running the Spark SQL CLI