SDS-2.x: Data Engineering and Data Science with Apache Spark
Contents:
- key concepts in distributed fault-tolerant storage and computing, and working knowledge of a data engineering scientist’s toolkit: Shell/Scala/SQL/, etc.
- practical understanding of the data science process:
- Data Engineering with Apache Spark: ingest, extract, load, transform and explore (IELTE) structured and unstructured datasets
- Data Science with Apache Spark: model, train/fit, validate/select, tune, test and predict (through an estimator) with a practical understanding of the underlying mathematics, numerics and statistics
- communicate and serve the model’s predictions to the clients
- practical applications of IELTE and various scalable predictive ML/AI models, using case-studies of real datasets
Study format, Assessment.
- There will be suggested assignments or mini-projects, often open-ended that you can choose from. These will be posed during the lab-lectures, and will generally involve undertaking a tutorial, auditing parts of other online courses, and primarily involve programming and data analysis in Apache Spark.
- The assessment format is open as it can allow for self-directed learning to some extent as different individuals may be at different stages of knowledge around Apache Spark and the hadoop ecosystem in general. There will be minimal expectations of course that everyone needs to satisfy.
- You will be expected to present your chosen assignments/mini-projects to your peers the following week. This will create an environment of learning from one another, especially when different assignment options are chosen by individuals or by teams of 2-4 individuals.
- Certificate of successful completion from lamastex.org that you can add to your Linked-In Profile (if you want), is based on attendance, course participation and successful completion of suggested programming assignments and most importantly a final peer-reviewed project.
- Part of your suggested assignments/mini-projects (that use publicly available datasets and codes) needs to be published in a public repository as part of your portfolio that provides evidence of your abilities (upon completing the course).
- However, the larger project involving teams of 2 to 4 individuals need not be made publicly available and you are anticipated to continue working on this after the completion of the course.
Instructions to Prepare for sds-2.x
Follow these instructions to prepare for the course.
Outline of Topics
Uploading Course Content into Databricks Community Edition
Course 1: Data Engineering with Apache Spark
Course 1 involved 32 hours of face-to-face training and 32 hours of homework.
- Introduction: What is Data Science, Data Engineering and the Data Engineering Science Process?
- Apache Spark and Big Data
- Map-Reduce, Transformations and Actions with Resilient Distributed datasets
- Ingest, Extract, Transform, Load and Explore with noSQL
- Distributed Vertex Programming, ETL and Graph Querying with GraphX and GraphFrames
- Spark Streaming with Discrete Resilient Distributed Datasets
- ETL of GDELT Dataset and XML-structured Dataset
- ETL, Exploration and Export of Structured, Semi-Structured and Unstructured Data and Models
- Spark Structured Streaming
- Sketching for Anomaly Detection in Streams
Course 2: Data Science with Apache Spark
Course 2 involved 80 hours of face-to-face training and 16 hours of course project.
-
Introduction to Data Science: A Computational, Mathematical and Statistical Approach
- Introduction to Simulation and Machine Learning
- Unsupervised Learning - Clustering
- Supervised Learning - Decision Trees
- Linear Algebra for Distributed Machine Learning
- Supervised Learning - Regression and Random Forests
- Unsupervised Learning - Latent Dirichlet Allocation
- Collaborative Filtering for Recommendation Systems
- Scalabe Geospatial Analytics
- Natural Language Processing
- Neural networks and Deep Learning
- Intro to Deep Learning
- Outline for DL
- Neural Networks
- Deep feed Forward NNs with Keras
- Hello Tensorflow
- Batch Tensorflow with Matrices
- Convolutional Neural Nets
- MNIST: Multi-Layer-Perceptron
- MNIST: Convolutional Neural net
- CIFAR-10: CNNs
- Recurrent Neural Nets and LSTMs
- LSTM solution
- LSTM spoke Zarathustra
- Generative Networks
- Reinforcement Learning
- DL Operations
- Privacy and GDPR-compliant Machine Learning
Assigned Minimal Exercises
These are complements to YouTrys in the notebooks (databricks community edition) in your local system (laptop/VMs). Note that these are just the minimal exercises for successful completion of the course. You are expected to be a self-directed learner and try out more complex exercises on your own by building from the minimal ones.
Assigned Minimal Exercises for Days 1 and 2
PREREQUISITES:
- You should have already installed docker and gone through the setup and preparation instructions for TASK 2.
- Successfully complete at least the SKINNY
docker-compose
steps 1-5 in Quick Start
- Complete the sbt tutorial
- Complete the exercises to use sbt and spark-submit packaged jars to a yarn-managed hdfs-based spark cluster.
More minimal exercises will be assigned once the more generously provisioned learning environment is ready. Please be self-directed and try out more complex exercises on your own either in the databricks community edition and/or in sbt in the local hadoop service.
Supplements
We will be supplementing the lecture notes with reading assignments from original sources.
Here are some resources that may be of further help.
Mathematical Statistical Foundations
- Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: https://www.cs.cornell.edu/jeh/book2016June9.pdf. It is intended as a modern theoretical course in computer science and statistical learning.
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
- Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from: https://statweb.stanford.edu/~tibs/ElemStatLearn/.
Data Science / Data Mining at Scale
- Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from: http://www.mmds.org/#ver21.
- Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
- Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
- Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.
Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.
Maths/Stats Refreshers
- Linear Algebra Refresher Course (with Python)
- Intro to Descriptive Statistics
- Intro to Inferential Statistics
Apache Spark / shell / github / Scala / Python / Tensorflow / R
- Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O’Reilly, 2015.
- Advanced analytics with Spark : patterns for learning from data at scale, O’Reilly, 2015.
- Command-line Basics
- How to use Git and GitHub: Version control for code
- Intro to Data Analysis: Using NumPy and Pandas
- Data Analysis with R by facebook
- Machine Learning Crash Course with TensorFlow APIs by Google Developers
- Data Visualization and D3.js
- Scala Programming
- Scala for Data Science, Pascal Bugnion, Packt Publishing, 416 pages, 2016.