Brief Overview of a 360-in-525 Minutes Course Set
For more details see Overview of a 360-in-525 Minutes Course Set in Data Sciences, Spring 2018
360-in-525-1: Introduction to Apache Spark for Data Scientists
This is a one-full-day workshop (1 hp) on April 20 2018 on Apache Spark, one of the most widely used open-source and commercially friendly software for analysing big data in industry and academia. A crash course in Scala, the language of Apache Spark, will be followed by introduction to resilient distributed datasets (RDDs), their transformations and actions, Spark DataSets and DataFrames, SparkSQL. We will have brief teasers on ML Pipelines, Streaming and GraphX as they will be covered in-depth in the sequel modules (concepts will be fortified by homework assignments you are expected to do!).
Course Content
YouTube Archive of lab-lectures:
- https://youtu.be/HCDzUQZHmaU
- https://youtu.be/Jk4puUbLb_E
- https://youtu.be/j6j4w6BkzZA
- https://youtu.be/dS2V3OanQt4
databricks notebooks individually
- 001. Why Spark?
- 002. login to databricks
- 003. Scala Crash Course
- 004. RDD, Transformations and Actions
- 005. RDDs Homework
- 006. Word Count
- 007. SparkSQL Intro
- 007a SparkSQL PG Homework
- 007b SparkSQL PG Homework
- 007c SparkSQL PG Homework
- 007d SparkSQL PG Homework
- 007e SparkSQL PG Homework
- 007f SparkSQL PG Homework
- 010. Wiki Click Streams
All databricks notebooks
Import all databricks notebooks for this module as a .dbc
file from: