# SDS-3.x: Scalable Data Science and Distributed Machine Learning

The course is developed for the AI-Track of the WASP Graduate School and for the Centre for Interdisciplinary Mathematics at Uppsala University.
It is given in several modules for the *ScadaMaLe-WASP* and *ScaDaMaLe-UU* course instances.

This site provides course contents for the two instances with multiple deep-dive pathways.
These contents, instances and pathways are packaged into numbered Modules that contain multiple course packs referred by **sds-3.x** suffixes here.

# ScadaMaLe-UU

This is the instance of the course for students at Uppsala University.

ScadaMaLe-UU is given in the following four modules that are worth 3-6hp each. Students can just take Module 01 or continue with more Modules.

In addition to academic lectures there may be invited guest speakers from industry.

# Module 01 (3 hp)

## Introduction to Data Science: Introduction to fault-tolerant distributed file systems and computing.

**Prerequisites:** Experience in at least one programming language, maturity in mathematical abstractions (or courses in multivariate calculus, linear algebra, probability, etc. at least at the undergarduate-level) and/or permission of the instructor.

The whole data science process illustrated with industrial case-studies. Practical introduction to scalable data processing to ingest, extract, load, transform, and explore (un)structured datasets. Scalable machine learning pipelines to model, train/fit, validate, select, tune, test and predict or estimate in an unsupervised and a supervised setting using nonparametric and partitioning methods such as random forests. Introduction to distributed vertex-programming.

Contents: 000_1-sds-3-x, 000_2-sds-3-x-ml

# Module 02 (3 hp)

## Distributed Deep Learning: Introduction to the theory and implementation of distributed deep learning.

**Prerequisites:** Passing Module 01 with active course participation and completion of Assignments.

Classification and regression using generalised linear models, including different learning, regularization, and hyperparameters tuning techniques. The feedforward deep network as a fundamental network, and the advanced techniques to overcome its main challenges, such as overfitting, vanishing/exploding gradient, and training speed. Various deep neural networks for various kinds of data. For example, the CNN for scaling up neural networks to process large images, RNN to scale up deep neural models to long temporal sequences, and autoencoder and GANs.

Contents: 000_6-sds-3-x-dl, 000_7-sds-3-x-ddl

# Module 03 (4 hp)

## Problem Domains and Projects in Data Science

**Prerequisites:** Module 01 and/or Module 02; or Module 4.

This module will allow one to explore different domains to solve specific decision problems (eg. prediction, A/B testing, anomaly detection, etc.) with various types of data (eg. time-indexed, space-time-indexed and network-indexed). Privacy-aware decisions with sanitized (cleaned, imputed, anonymised) datasets and datastreams. Practical applications of these algorithms on real-world examples (eg. mobility, social media, machine sensors and logs). Illustration via industrial use-cases.

As we explore different domains, students are encouraged to form groups to do a group project in an application domain we have explored or another they can explore bsed on their preparedness from Modules 01, 02 and 03. Such projects are typically meant to be of direct relevance to a student’s research area.

Contents: 000_3-sds-3-st, 000_4-sds-3-x-ss, 000_8-sds-3-x-pri, 000_9-sds-3-x-trends

# Module 04 (6 hp)

## Distributed Algorithms and Optimisation (advanced)

This course is for advanced PhD students who have already taken a previous instance of ScaDaMaLe-WASP or ScaDaMaLe-UU (sds and 360-in-525 series in 2017 or 2018) and have the permission of the instructor.

Theoretical foundations of distributed systems and analysis of their scalable algorithms for sorting, joining, streaming, sketching, optimising and computing in numerical linear algebra with applications in scalable machine learning pipelines for typical decision problems.

Here we will be doing a reading course aimed to learn from and refine the dao lecture notes of Reza Zadeh at Stanford University.

Module 04 may be combined with Module 03 if the student wants to dive deeper on a theoretical project.

# Course Content

Upload Course Content as `.dbc`

file into Databricks Community Edition.

- 2021 dbc ARCHIVES -
**to be updated by 20220315**

The databricks notebooks have been made available as the following course modules:

# 000_1-sds-3-x

## Introduction to Scalable Data Science and Distributed Machine Learning

**Topics:** *Apache Spark, Scala, RDD, map-reduce, Ingest, Extract, Load, Transform and Explore with noSQL in SparkSQL.*

# 000_2-sds-3-x-ml

## Deeper Dive into Distributed Machine Learning

**Topics:** *Distributed Simulation; various un/supervised ML Algorithms; Linear Algebra; Vertex Programming using SparkML, GraphX and piped-RDDs.*

# 000_3-sds-3-x-st

## Introduction to Spark Streaming

**Topics:** *Introduction to Spark Streaming with Discrete RDDs and live Experiments in Twitter with Interactive custom D3 Interactions.*

# 000_4-sds-3-x-ss

## Introduction to Spark Structured Streaming

**Topics:** *Introduction to Spark Structured Streaming and Sketching*

# 000_6-sds-3-x-dl

## Introduction to Deep Learning

**Topics:** *Introduction to Deep Learning with Keras, tensorflow, pytorch and PySpark. Topics: TensorFlow Basics, Artificial Neural Networks, Multilayer Deep Feed-Forward Neural Networks, Training, Convolutional Neural Networks, Recurrent Neural Networks like LSTM and GRU, Generative Networks or Patterns, Introduction to Reinforcement Learning, Real-world Operations and MLOps with PyTorch and MLflow for image classification.*

# 000_7-sds-3-x-ddl

## Introduction to Distributed Deep Learning

**Topics:** *Introduction to Distributed Deep Learning (DDL) with Horovod over Tensorflow/keras and Pytorch. DDL of various CNN architectures for image segmentation with horovod, MLFlow and hyper-parameter tuning through SparkTrials.*

# 000_8-sds-3-x-pri

## Privacy-Preserving Data Science

**Topics:** *Introduction to Privacy-aware Scalable Data Science. Topics: Data Sanitization, Algorithmic Sanitization, GDPR law and its implications for AI/ML, Pseudonymzation and Differential Privacy, Minimax Optimal Procedures for Locally Private Estimation*

# 000_9-sds-3-x-trends

## Trends in Money Market and Media

**Topics:** *Trends in Financial Stocks and News Events using GDELT mass media metadata and yfinance via a Scalably Streamable Trend Calculus. Exploring Events and Persons of Interest at Time Points of Trend Reversals in Commodity Oil Price.*

# Reference Readings

- dao
- https://learning.oreilly.com/library/view/high-performance-spark/9781491943199/
- https://learning.oreilly.com/library/view/spark-the-definitive/9781491912201/
- https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/
- Introduction to Algorithms, Third Edition, Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein from

## Supplements

Several freely available MOOCs, hyperlinks and reference books are used to bolster the learning experience. Plese see references to such additional supplemantary resources in the above content.

We will be supplementing the lecture notes with reading assignments from original sources.

Here are some resources that may be of further help.

### Recommended Preparations for Data *Engineering* Scientists:

- You should have already installed docker and gone through the setup and preparation instructions for TASK 2.
- Successfully complete at least the SKINNY
`docker-compose`

steps 1-5 in Quick Start

- Complete the sbt tutorial
- Complete the exercises to use sbt and spark-submit packaged jars to a yarn-managed hdfs-based spark cluster.

Please be self-directed and try out more complex exercises on your own either in the databricks community edition and/or in sbt in the local hadoop service.

### Mathematical Statistical Foundations

- Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Freely available from: https://www.cs.cornell.edu/jeh/book2016June9.pdf. It is intended as a modern theoretical course in computer science and statistical learning.
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
- Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. Freely available from: https://statweb.stanford.edu/~tibs/ElemStatLearn/.

### Data Science / Data Mining at Scale

- Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. Freely available from: http://www.mmds.org/#ver21.
- Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
- Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
- Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014.

Here are some free online courses if you need quick refreshers or want to go indepth into specific subjects.

### Apache Spark / shell / github / Scala / Python / Tensorflow / R

- Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O’Reilly, 2015.
- Advanced analytics with Spark : patterns for learning from data at scale, O’Reilly, 2015.
- Command-line Basics
- How to use Git and GitHub: Version control for code
- Intro to Data Analysis: Using NumPy and Pandas
- Data Analysis with R by facebook
- Machine Learning Crash Course with TensorFlow APIs by Google Developers
- Data Visualization and D3.js
- Scala Programming
- Scala for Data Science, Pascal Bugnion, Packt Publishing, 416 pages, 2016.

### Computer Science Refreshers

## ScadaMaLe-WASP

This instance for the AI-Track of the WASP Graduate Schoolis divided into the following three modules.

The next instance of ScadaMaLe-WASP is in 2022 Fall.

In addition to academic lectures there will be invited guest speakers from industry.

**Module 01** (2 hp) – Introduction to Data Science: Introduction to fault-tolerant distributed file systems and computing.

Prerequisites: Programming experience in at least one programming language and permission of the instructor.

The whole data science process illustrated with industrial case-studies. Practical introduction to scalable data processing to ingest, extract, load, transform, and explore (un)structured datasets. Scalable machine learning pipelines to model, train/fit, validate, select, tune, test and predict or estimate in an unsupervised and a supervised setting using nonparametric and partitioning methods such as random forests. Introduction to distributed vertex-programming.

**Module 02** (2 hp) – Distributed Deep Learning: Introduction to the theory and implementation of distributed deep learning.

Prerequisites: Passing Module 01 with active course participation.

Classification and regression using generalised linear models, including different learning, regularization, and hyperparameters tuning techniques. The feedforward deep network as a fundamental network, and the advanced techniques to overcome its main challenges, such as overfitting, vanishing/exploding gradient, and training speed. Various deep neural networks for various kinds of data. For example, the CNN for scaling up neural networks to process large images, RNN to scale up deep neural models to long temporal sequences, and autoencoder and GANs.

**Module 03** (2 hp) – Decision-making with Scalable Algorithms

Theoretical foundations of distributed systems and analysis of their scalable algorithms for sorting, joining, streaming, sketching, optimising and computing in numerical linear algebra with applications in scalable machine learning pipelines for typical decision problems (eg. prediction, A/B testing, anomaly detection) with various types of data (eg. time-indexed, space-time-indexed and network-indexed). Privacy-aware decisions with sanitized (cleaned, imputed, anonymised) datasets and datastreams. Practical applications of these algorithms on real-world examples (eg. mobility, social media, machine sensors and logs). Illustration via industrial use-cases.