SDS-2.x: Data Engineering and Data Science with Apache Spark

Contents:

key concepts in distributed fault-tolerant storage and computing, and working knowledge of a data engineering scientist’s toolkit: Shell/Scala/SQL/, etc.
practical understanding of the data science process:
- Data Engineering with Apache Spark: ingest, extract, load, transform and explore (IELTE) structured and unstructured datasets
- Data Science with Apache Spark: model, train/fit, validate/select, tune, test and predict (through an estimator) with a practical understanding of the underlying mathematics, numerics and statistics
- communicate and serve the model’s predictions to the clients
practical applications of IELTE and various scalable predictive ML/AI models, using case-studies of real datasets

Study format, Assessment.

There will be suggested assignments or mini-projects, often open-ended that you can choose from. These will be posed during the lab-lectures, and will generally involve undertaking a tutorial, auditing parts of other online courses, and primarily involve programming and data analysis in Apache Spark.
- The assessment format is open as it can allow for self-directed learning to some extent as different individuals may be at different stages of knowledge around Apache Spark and the hadoop ecosystem in general. There will be minimal expectations of course that everyone needs to satisfy.
You will be expected to present your chosen assignments/mini-projects to your peers the following week. This will create an environment of learning from one another, especially when different assignment options are chosen by individuals or by teams of 2-4 individuals.
Certificate of successful completion from lamastex.org that you can add to your Linked-In Profile (if you want), is based on attendance, course participation and successful completion of suggested programming assignments and most importantly a final peer-reviewed project.
- Part of your suggested assignments/mini-projects (that use publicly available datasets and codes) needs to be published in a public repository as part of your portfolio that provides evidence of your abilities (upon completing the course).
- However, the larger project involving teams of 2 to 4 individuals need not be made publicly available and you are anticipated to continue working on this after the completion of the course.

Instructions to Prepare for sds-2.x

Follow these instructions to prepare for the course.

Outline of Topics

Uploading Course Content into Databricks Community Edition

Course 1: Data Engineering with Apache Spark

Course 1 involved 32 hours of face-to-face training and 32 hours of homework.

Introduction: What is Data Science, Data Engineering and the Data Engineering Science Process?
- Introduction
Apache Spark and Big Data
Map-Reduce, Transformations and Actions with Resilient Distributed datasets
Ingest, Extract, Transform, Load and Explore with noSQL
Distributed Vertex Programming, ETL and Graph Querying with GraphX and GraphFrames
- Graph Frames Intro
- Ontime Flight Performance
Spark Streaming with Discrete Resilient Distributed Datasets
ETL of GDELT Dataset and XML-structured Dataset
- GDELT dataset
- Old Bailey Online - ETL of XML
ETL, Exploration and Export of Structured, Semi-Structured and Unstructured Data and Models
Spark Structured Streaming
Sketching for Anomaly Detection in Streams

Course 2: Data Science with Apache Spark

Course 2 involved 80 hours of face-to-face training and 16 hours of course project.

Introduction to Data Science: A Computational, Mathematical and Statistical Approach
Introduction to Simulation and Machine Learning
- Simulation Intro
- Machine Learning Intro
Unsupervised Learning - Clustering
Supervised Learning - Decision Trees
- Decision Trees for Digits
Linear Algebra for Distributed Machine Learning
Supervised Learning - Regression and Random Forests
Unsupervised Learning - Latent Dirichlet Allocation
- 20 Newsgroups - Latent Dirichlet Allocation
- Cornell Movie Dialogs - Latent Dirichlet Allocation
Collaborative Filtering for Recommendation Systems
- Movie Recommendation - Alternating Least Squares
Scalabe Geospatial Analytics
Natural Language Processing
Neural networks and Deep Learning
Privacy and GDPR-compliant Machine Learning

Assigned Minimal Exercises

These are complements to YouTrys in the notebooks (databricks community edition) in your local system (laptop/VMs). Note that these are just the minimal exercises for successful completion of the course. You are expected to be a self-directed learner and try out more complex exercises on your own by building from the minimal ones.

Assigned Minimal Exercises for Days 1 and 2

PREREQUISITES:

You should have already installed docker and gone through the setup and preparation instructions for TASK 2.
Successfully complete at least the SKINNY docker-compose steps 1-5 in Quick Start

Complete the sbt tutorial
Complete the exercises to use sbt and spark-submit packaged jars to a yarn-managed hdfs-based spark cluster.

More minimal exercises will be assigned once the more generously provisioned learning environment is ready. Please be self-directed and try out more complex exercises on your own either in the databricks community edition and/or in sbt in the local hadoop service.

Supplements

We will be supplementing the lecture notes with reading assignments from original sources.

Here are some resources that may be of further help.

Mathematical Statistical Foundations

Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: https://www.cs.cornell.edu/jeh/book2016June9.pdf. It is intended as a modern theoretical course in computer science and statistical learning.
Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from: https://statweb.stanford.edu/~tibs/ElemStatLearn/.

Data Science / Data Mining at Scale

Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from: http://www.mmds.org/#ver21.
Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.

Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.

Maths/Stats Refreshers

Apache Spark / shell / github / Scala / Python / Tensorflow / R

Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O’Reilly, 2015.
Advanced analytics with Spark : patterns for learning from data at scale, O’Reilly, 2015.
Command-line Basics
- Linux Commnad-line Basics
- Windows Command-line Bascis
How to use Git and GitHub: Version control for code
Intro to Data Analysis: Using NumPy and Pandas
Data Analysis with R by facebook
Machine Learning Crash Course with TensorFlow APIs by Google Developers
Data Visualization and D3.js
Scala Programming
Scala for Data Science, Pascal Bugnion, Packt Publishing, 416 pages, 2016.

Computer Science Refreshers

Share on

Twitter Facebook Google+ LinkedIn