SDS-2.x: Data Engineering and Data Science with Apache Spark


  • key concepts in distributed fault-tolerant storage and computing, and working knowledge of a data engineering scientist’s toolkit: Shell/Scala/SQL/, etc.
  • practical understanding of the data science process:
    • Data Engineering with Apache Spark: ingest, extract, load, transform and explore (IELTE) structured and unstructured datasets
    • Data Science with Apache Spark: model, train/fit, validate/select, tune, test and predict (through an estimator) with a practical understanding of the underlying mathematics, numerics and statistics
    • communicate and serve the model’s predictions to the clients
  • practical applications of IELTE and various scalable predictive ML/AI models, using case-studies of real datasets

Study format, Assessment.

  • There will be suggested assignments or mini-projects, often open-ended that you can choose from. These will be posed during the lab-lectures, and will generally involve undertaking a tutorial, auditing parts of other online courses, and primarily involve programming and data analysis in Apache Spark.
    • The assessment format is open as it can allow for self-directed learning to some extent as different individuals may be at different stages of knowledge around Apache Spark and the hadoop ecosystem in general. There will be minimal expectations of course that everyone needs to satisfy.
  • You will be expected to present your chosen assignments/mini-projects to your peers the following week. This will create an environment of learning from one another, especially when different assignment options are chosen by individuals or by teams of 2-4 individuals.
  • Certificate of successful completion from that you can add to your Linked-In Profile (if you want), is based on attendance, course participation and successful completion of suggested programming assignments and most importantly a final peer-reviewed project.
    • Part of your suggested assignments/mini-projects (that use publicly available datasets and codes) needs to be published in a public repository as part of your portfolio that provides evidence of your abilities (upon completing the course).
    • However, the larger project involving teams of 2 to 4 individuals need not be made publicly available and you are anticipated to continue working on this after the completion of the course.

Instructions to Prepare for sds-2.x

Follow these instructions to prepare for the course.

Outline of Topics

Uploading Course Content into Databricks Community Edition

Course 1: Data Engineering with Apache Spark

Course 1 involved 32 hours of face-to-face training and 32 hours of homework.

  1. Introduction: What is Data Science, Data Engineering and the Data Engineering Science Process?
  2. Apache Spark and Big Data
  3. Map-Reduce, Transformations and Actions with Resilient Distributed datasets
  4. Ingest, Extract, Transform, Load and Explore with noSQL
  5. Distributed Vertex Programming, ETL and Graph Querying with GraphX and GraphFrames
  6. Spark Streaming with Discrete Resilient Distributed Datasets
  7. ETL of GDELT Dataset and XML-structured Dataset
  8. ETL, Exploration and Export of Structured, Semi-Structured and Unstructured Data and Models
  9. Spark Structured Streaming
  10. Sketching for Anomaly Detection in Streams

Course 2: Data Science with Apache Spark

Course 2 involved 80 hours of face-to-face training and 16 hours of course project.

  1. Introduction to Data Science: A Computational, Mathematical and Statistical Approach

  2. Introduction to Simulation and Machine Learning
  3. Unsupervised Learning - Clustering
  4. Supervised Learning - Decision Trees
  5. Linear Algebra for Distributed Machine Learning
  6. Supervised Learning - Regression and Random Forests
  7. Unsupervised Learning - Latent Dirichlet Allocation
  8. Collaborative Filtering for Recommendation Systems
  9. Scalabe Geospatial Analytics
  10. Natural Language Processing
  11. Neural networks and Deep Learning
  12. Privacy and GDPR-compliant Machine Learning

Assigned Minimal Exercises

These are complements to YouTrys in the notebooks (databricks community edition) in your local system (laptop/VMs). Note that these are just the minimal exercises for successful completion of the course. You are expected to be a self-directed learner and try out more complex exercises on your own by building from the minimal ones.

Assigned Minimal Exercises for Days 1 and 2


  1. Complete the sbt tutorial
  2. Complete the exercises to use sbt and spark-submit packaged jars to a yarn-managed hdfs-based spark cluster.

More minimal exercises will be assigned once the more generously provisioned learning environment is ready. Please be self-directed and try out more complex exercises on your own either in the databricks community edition and/or in sbt in the local hadoop service.


We will be supplementing the lecture notes with reading assignments from original sources.

Here are some resources that may be of further help.

Mathematical Statistical Foundations

  • Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: It is intended as a modern theoretical course in computer science and statistical learning.
  • Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
  • Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from:

Data Science / Data Mining at Scale

  • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from:
  • Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
  • Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
  • Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.

Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.

Maths/Stats Refreshers

Apache Spark / shell / github / Scala / Python / Tensorflow / R

Computer Science Refreshers