SDS2.x: Data Engineering and Data Science with Apache Spark
Contents:
 key concepts in distributed faulttolerant storage and computing, and working knowledge of a data engineering scientist’s toolkit: Shell/Scala/SQL/, etc.
 practical understanding of the data science process:
 Data Engineering with Apache Spark: ingest, extract, load, transform and explore (IELTE) structured and unstructured datasets
 Data Science with Apache Spark: model, train/fit, validate/select, tune, test and predict (through an estimator) with a practical understanding of the underlying mathematics, numerics and statistics
 communicate and serve the model’s predictions to the clients
 practical applications of IELTE and various scalable predictive ML/AI models, using casestudies of real datasets
Study format, Assessment.
 There will be suggested assignments or miniprojects, often openended that you can choose from. These will be posed during the lablectures, and will generally involve undertaking a tutorial, auditing parts of other online courses, and primarily involve programming and data analysis in Apache Spark.
 The assessment format is open as it can allow for selfdirected learning to some extent as different individuals may be at different stages of knowledge around Apache Spark and the hadoop ecosystem in general. There will be minimal expectations of course that everyone needs to satisfy.
 You will be expected to present your chosen assignments/miniprojects to your peers the following week. This will create an environment of learning from one another, especially when different assignment options are chosen by individuals or by teams of 24 individuals.
 Certificate of successful completion from lamastex.org that you can add to your LinkedIn Profile (if you want), is based on attendance, course participation and successful completion of suggested programming assignments and most importantly a final peerreviewed project.
 Part of your suggested assignments/miniprojects (that use publicly available datasets and codes) needs to be published in a public repository as part of your portfolio that provides evidence of your abilities (upon completing the course).
 However, the larger project involving teams of 2 to 4 individuals need not be made publicly available and you are anticipated to continue working on this after the completion of the course.
Instructions to Prepare for sds2.x
Follow these instructions to prepare for the course.
Outline of Topics
Uploading Course Content into Databricks Community Edition
Course 1: Data Engineering with Apache Spark
Course 1 involved 32 hours of facetoface training and 32 hours of homework.
 Introduction: What is Data Science, Data Engineering and the Data Engineering Science Process?
 Apache Spark and Big Data
 MapReduce, Transformations and Actions with Resilient Distributed datasets
 Ingest, Extract, Transform, Load and Explore with noSQL
 Distributed Vertex Programming, ETL and Graph Querying with GraphX and GraphFrames
 Spark Streaming with Discrete Resilient Distributed Datasets
 ETL of GDELT Dataset and XMLstructured Dataset
 ETL, Exploration and Export of Structured, SemiStructured and Unstructured Data and Models
 Spark Structured Streaming
 Sketching for Anomaly Detection in Streams
 Spark Performance Tuning
Course 2: Data Science with Apache Spark
Course 2 involved 80 hours of facetoface training and 16 hours of course project.

Introduction to Data Science: A Computational, Mathematical and Statistical Approach
 Introduction to Simulation and Machine Learning
 Unsupervised Learning  Clustering
 Supervised Learning  Decision Trees
 Linear Algebra for Distributed Machine Learning
 Supervised Learning  Regression and Random Forests
 Unsupervised Learning  Latent Dirichlet Allocation
 Collaborative Filtering for Recommendation Systems
 Scalabe Geospatial Analytics
 Natural Language Processing
 Neural networks and Deep Learning
 Intro to Deep Learning
 Outline for DL
 Neural Networks
 Deep feed Forward NNs with Keras
 Hello Tensorflow
 Batch Tensorflow with Matrices
 Convolutional Neural Nets
 MNIST: MultiLayerPerceptron
 MNIST: Convolutional Neural net
 CIFAR10: CNNs
 Recurrent Neural Nets and LSTMs
 LSTM solution
 LSTM spoke Zarathustra
 Generative Networks
 Reinforcement Learning
 DL Operations
 Privacy and GDPRcompliant Machine Learning
Assigned Minimal Exercises
These are complements to YouTrys in the notebooks (databricks community edition) in your local system (laptop/VMs). Note that these are just the minimal exercises for successful completion of the course. You are expected to be a selfdirected learner and try out more complex exercises on your own by building from the minimal ones.
Assigned Minimal Exercises for Days 1 and 2
PREREQUISITES:
 You should have already installed docker and gone through the setup and preparation instructions for TASK 2.
 Successfully complete at least the SKINNY
dockercompose
steps 15 in Quick Start
 Complete the sbt tutorial
 Complete the exercises to use sbt and sparksubmit packaged jars to a yarnmanaged hdfsbased spark cluster.
More minimal exercises will be assigned once the more generously provisioned learning environment is ready. Please be selfdirected and try out more complex exercises on your own either in the databricks community edition and/or in sbt in the local hadoop service.
Supplements
We will be supplementing the lecture notes with reading assignments from original sources.
Here are some resources that may be of further help.
Mathematical Statistical Foundations
 Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: https://www.cs.cornell.edu/jeh/book2016June9.pdf. It is intended as a modern theoretical course in computer science and statistical learning.
 Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
 Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from: https://statweb.stanford.edu/~tibs/ElemStatLearn/.
Data Science / Data Mining at Scale
 Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from: http://www.mmds.org/#ver21.
 Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Dataanalytic Thinking. ISBN 1449361323. 2013.
 Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
 Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.
Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.
Maths/Stats Refreshers
 Linear Algebra Refresher Course (with Python)
 Intro to Descriptive Statistics
 Intro to Inferential Statistics
Apache Spark / shell / github / Scala / Python / Tensorflow / R
 Learning Spark : lightningfast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O’Reilly, 2015.
 Advanced analytics with Spark : patterns for learning from data at scale, O’Reilly, 2015.
 Commandline Basics
 How to use Git and GitHub: Version control for code
 Intro to Data Analysis: Using NumPy and Pandas
 Data Analysis with R by facebook
 Machine Learning Crash Course with TensorFlow APIs by Google Developers
 Data Visualization and D3.js
 Scala Programming
 Scala for Data Science, Pascal Bugnion, Packt Publishing, 416 pages, 2016.