SDS-1.6 on databricks
Scalable Data Science from Middle Earth, A Big Data Course in Apache Spark 1.6 over databricks
How to self-learn this content?
The 2016 instance of this scalable-data-science course finished on June 30 2016.
To learn Apache Spark for free try databricks Community edition by starting from https://databricks.com/try-databricks.
All course content can be uploaded for self-paced learning by copying the following URL for 2016/Spark1_6_to_1_3/scalable-data-science.dbc archive and importing it from the URL to your free Databricks Community Edition.
The Gitbook version of this content is https://www.gitbook.com/book/raazesh-sainudiin/scalable-data-science/details.
The browsable git-pages version of the content is http://raazesh-sainudiin.github.io/scalable-data-science/.
How to cite this work?
Scalable Data Science, Raazesh Sainudiin and Sivanand Sivaram, Published by GitBook https://www.gitbook.com/book/raazesh-sainudiin/scalable-data-science/details, 787 pages, 30th June 2016.
Supported By
Databricks Academic Partners Program and Amazon Web Services Educate.
Summary of Contents
- Week 1: Introduction to Scalable Data Science
- Week 2: Introduction to Spark RDDs, Transformations and Actions and Word Count of the US State of the Union Addresses
- Week 3: Introduction to Spark SQL, ETL and EDA of Diamonds, Power Plant and Wiki CLick Streams Data
- Week 4: Introduction to Machine Learning - Unsupervised Clustering and Supervised Classification
- Week 5: Introduction to Non-distributed and Distributed Linear Algebra and Applied Linear Regression
- Week 6: Introduction to Spark Streaming, Twitter Collector, Top Hashtag Counter and Streaming Model-Prediction Server
- Week 7: Probabilistic Topic Modelling via Latent Dirichlet Allocation and Intro to XML-parsing of Old Bailey Online
- Week 8: Graph Querying in GraphFrames and Distributed Vertex Programming in GraphX
- Week 9: Deep Learning, Convolutional Neural Nets, Sparkling Water and Tensor Flow
- Week 10: Scalable Geospatial Analytics with Magellan
- Week 11 and 12: Student Projects
- Student Projects
- Dillon George, Scalable Geospatial Algorithms
- Akinwande Atanda, Twitter Analytics
- Yinnon Dolev, Deciphering Spider Vision
- Xin Zhao, Higher Order Spectral CLustering
- Shanshan Zhou, Exploring EEG
- Shakira Suwan, Change Detection in Random Graph Series
- Matthew Hendtlass, The ATP graph
- Andrey Konstantinov, Keystroke Biometric
- Dominic Lee, Random Matrices
- Harry Wallace, Movie Recommender
- Ivan Sadikov, Reading NetFlow Logs
- Extra Resources
- AWS Educate
- Databricksified Spark SQL Programming Guide 1.6
- Linear Algebra Cheat Sheet
- Databricksified Data Types in MLLib Programming Guide 1.6
- Introduction to XML-parsing of Old Bailey Online
Contribute
All course content is currently being pushed by Raazesh Sainudiin after it has been tested in Databricks cloud (mostly under Spark 1.6 and some involving Magellan under Spark 1.5.1).
The markdown version for gitbook
is generated from the Databricks .scala
, .py
and other source codes.
The gitbook is not a substitute for the Databricks notebooks available in the Databricks cloud. The following issues need to be resolved:
- need to find a stable solution for the output of various databricks cells to be shown in gitbook, including those from
display_HTML
andframeIt
with their in-place embeds of web content.
Please feel free to fork the github repository:
Furthermore, due to the anticipation of Spark 2.0 this mostly Spark 1.6 version could be enhanced with a 2.0 version-specific upgrade.
Please send any typos or suggestions to raazesh.sainudiin@gmail.com
Please read a note on babel to understand how the gitbook is generated from the .scala
source of the databricks notebook.
Raazesh Sainudiin, Laboratory for Mathematical Statistical Experiments, Christchurch Centre and School of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch 8041, Aotearoa New Zealand
Sun Jun 19 21:59:19 NZST 2016