SDS-3.x: Scalable Data Science and Distributed Machine Learning

The course is the fifth and final mandatory course in the AI-Track of the WASP Graduate School. It is given in three modules. In addition to academic lectures there is invited guest speakers from industry.

This site provides course contents for modules 1 and 3 with some background materials for module 2. This content is referred to as sds-3.x here.

Module 1 – Introduction to Data Science: Introduction to fault-tolerant distributed file systems and computing.

The whole data science process illustrated with industrial case-studies. Practical introduction to scalable data processing to ingest, extract, load, transform, and explore (un)structured datasets. Scalable machine learning pipelines to model, train/fit, validate, select, tune, test and predict or estimate in an unsupervised and a supervised setting using nonparametric and partitioning methods such as random forests. Introduction to distributed vertex-programming.

Module 2 – Distributed Deep Learning: Introduction to the theory and implementation of distributed deep learning.

Classification and regression using generalised linear models, including different learning, regularization, and hyperparameters tuning techniques. The feedforward deep network as a fundamental network, and the advanced techniques to overcome its main challenges, such as overfitting, vanishing/exploding gradient, and training speed. Various deep neural networks for various kinds of data. For example, the CNN for scaling up neural networks to process large images, RNN to scale up deep neural models to long temporal sequences, and autoencoder and GANs.

Module 3 – Decision-making with Scalable Algorithms

Theoretical foundations of distributed systems and analysis of their scalable algorithms for sorting, joining, streaming, sketching, optimising and computing in numerical linear algebra with applications in scalable machine learning pipelines for typical decision problems (eg. prediction, A/B testing, anomaly detection) with various types of data (eg. time-indexed, space-time-indexed and network-indexed). Privacy-aware decisions with sanitized (cleaned, imputed, anonymised) datasets and datastreams. Practical applications of these algorithms on real-world examples (eg. mobility, social media, machine sensors and logs). Illustration via industrial use-cases.

Course Content

Upload Course Content as .dbc file into Databricks Community Edition.

Reading Materials Provided

Expected Reference Readings (you need to be logged into your library with access to these publishers):

The databricks notebooks have been made available as the following course modules:


Introduction to Scalable Data Science and Distributed Machine Learning.

Topics: Apache Spark, Scala, RDD, map-reduce, Ingest, Extract, Load, Transform and Explore with noSQL in SparkSQL.

  1. Introduction: What is Data Science, Data Engineering and the Data Engineering Science Process?
  2. Apache Spark and Big Data
  3. Map-Reduce, Transformations and Actions with Resilient Distributed datasets
  4. Ingest, Extract, Transform, Load and Explore with noSQL
  5. Ethics, Explainability and Fairness - An Operational View


Deeper Dive into Distributed Machine Learning

Topics: Distributed Simulation; various un/supervised ML Algorithms; Linear Algebra; Vertex Programming using SparkML, GraphX and piped-RDDs.

  1. Creating packages within notebooks
  2. Introduction to Distributed Simulation and Machine Learning
  3. Unsupervised Learning - Clustering, K-Means of 1 Million Songs
  4. Supervised Learning - Clustering, Decision Trees and Hand-written Digit Recognition
  5. Linear Algebra for Distributed Machine Learning
  6. Supervised Learning - Regression and Random Forests
  7. Old Bailey Online - ETL of XML
  8. Piped RDDs - Rigorous Bayesian AB Testing on Old Bailey Online Data
  9. Latent Dirichlet Allocation of NewsGroups and Cornell Movie Dialogs
  10. Collaborative Filtering for Recommendation Systems
  11. Extending built-in functions in GraphX
  12. Fraud Detection with Decision Trees


Several freely available MOOCs, hyperlinks and reference books are used to bolster the learning experience. Plese see references to such additional supplemantary resources in the above content.

We will be supplementing the lecture notes with reading assignments from original sources.

Here are some resources that may be of further help.

  1. Complete the sbt tutorial
  2. Complete the exercises to use sbt and spark-submit packaged jars to a yarn-managed hdfs-based spark cluster.

Please be self-directed and try out more complex exercises on your own either in the databricks community edition and/or in sbt in the local hadoop service.

Mathematical Statistical Foundations

  • Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: It is intended as a modern theoretical course in computer science and statistical learning.
  • Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
  • Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from:

Data Science / Data Mining at Scale

  • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from:
  • Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
  • Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
  • Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.

Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.

Apache Spark / shell / github / Scala / Python / Tensorflow / R

Computer Science Refreshers