Introduction to Data Engineering Science


  • key concepts in distributed fault-tolerant storage and computing, and working knowledge of a data engineering scientist’s toolkit: Shell/Scala/SQL/, etc.
  • practical understanding of the data science process:
    • engineering focus: ingest, extract, load, transform and explore (IELTE) structured and unstructured datasets
    • science focus: model, train/fit, validate/select, tune, test and predict (through an estimator) with a practical understanding of the underlying mathematics, numerics and statistics
    • communicate and serve the model’s predictions to the clients
  • practical applications of IELTE and various scalable predictive ML/AI models, using case-studies of real datasets

Study format, Assessment.

  • There will be suggested assignments or mini-projects, often open-ended that you can choose from. These will be posed during the lab-lectures, and will generally involve undertaking a tutorial, auditing parts of other online courses, and primarily involve programming and data analysis in Apache Spark.
    • The assessment format is open as it can allow for self-directed learning to some extent as different individuals may be at different stages of knowledge around Apache Spark and the hadoop ecosystem in general. There will be minimal expectations of course that everyone needs to satisfy.
  • You will be expected to present your chosen assignments/mini-projects to your peers the following week. This will create an environment of learning from one another, especially when different assignment options are chosen by individuals or by teams of 2-4 individuals.
  • Certificate of successful completion from that you can add to your Linked-In Profile (if you want), is based on attendance, course participation and successful completion of suggested programming assignments and most importantly a final peer-reviewed project.
    • Part of your suggested assignments/mini-projects (that use publicly available datasets and codes) needs to be published in a public repository as part of your portfolio that provides evidence of your abilities (upon completing the course).
    • However, the larger project involving teams of 2 to 4 individuals need not be made publicly available and you are anticipated to continue working on this after the completion of the course.

Instructions to Prepare for sds-2.x

Follow these instructions to prepare for the course.

Tentative Outline of Topics

  1. Uploading Course Content into Databricks

Thes steps on how to upload will be explained face-to-face.

  1. Introduction: What is Data Science, Data Engineering and the Data Engineering Science Process?
  2. Apache Spark and Big Data
  3. Map-Reduce, Transformations and Actions with Resilient Distributed datasets
  4. Ingest, Extract, Transform, Load and Explore with noSQL

Assigned Minimal Exercises

These are complements to YouTrys in the notebooks (databricks community edition) in your local system (laptop/VMs). Note that these are just the minimal exercises for successful completion of the course. You are expected to be a self-directed learner and try out more complex exercises on your own by building from the minimal ones.

Assigned Minimal Exercises for Days 1 and 2


  1. Complete the sbt tutorial
  2. Complete the exercises to use sbt and spark-submit packaged jars to a yarn-managed hdfs-based spark cluster.

More minimal exercises will be assigned once the more generously provisioned learning environment is ready. Please be self-directed and try out more complex exercises on your own either in the databricks community edition and/or in sbt in the local hadoop service.


We will be supplementing the lecture notes with reading assignments from original sources.

Here are some resources that may be of further help.

Mathematical Statistical Foundations

  • Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: It is intended as a modern theoretical course in computer science and statistical learning.
  • Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
  • Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from:

Data Science / Data Mining at Scale

  • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from:
  • Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
  • Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
  • Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.

Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.

Maths/Stats Refreshers

Apache Spark / shell / github / Scala / Python / Tensorflow / R

Computer Science Refreshers