000_ScaDaMaLe(Scala)

Loading...

ScaDaMaLe Course site and book

Introduction

  • Course Name: Scalable Data Science and Distributed Machine Learning
  • Course Acronym: ScaDaMaLe or sds-3.x.

The course was designed to be the fifth and final mandatory course in the AI-Track of the WASP Graduate School in 2021. From 2022 ScaDaMaLe is an optional course for WASP students who have successfully completed the mandatory courses. It is given in three modules. In addition to academic lectures there are invited guest speakers from industry.

The course can also be taken by select post-graduate students at Uppsala University as a Special Topics Course from the Department of Mathematics.

This site provides course contents for the three modules. This content is referred to as sds-3.x here.

Module 1 – Introduction to Data Science: Introduction to fault-tolerant distributed file systems and computing.

The whole data science process illustrated with industrial case-studies. Practical introduction to scalable data processing to ingest, extract, load, transform, and explore (un)structured datasets. Scalable machine learning pipelines to model, train/fit, validate, select, tune, test and predict or estimate in an unsupervised and a supervised setting using nonparametric and partitioning methods such as random forests. Introduction to distributed vertex-programming.

Module 2 – Distributed Deep Learning: Introduction to the theory and implementation of distributed deep learning.

Classification and regression using generalised linear models, including different learning, regularization, and hyperparameters tuning techniques. The feedforward deep network as a fundamental network, and the advanced techniques to overcome its main challenges, such as overfitting, vanishing/exploding gradient, and training speed. Various deep neural networks for various kinds of data. For example, the CNN for scaling up neural networks to process large images, RNN to scale up deep neural models to long temporal sequences, and autoencoder and GANs.

Module 3 – Decision-making with Scalable Algorithms

Theoretical foundations of distributed systems and analysis of their scalable algorithms for sorting, joining, streaming, sketching, optimising and computing in numerical linear algebra with applications in scalable machine learning pipelines for typical decision problems (eg. prediction, A/B testing, anomaly detection) with various types of data (eg. time-indexed, space-time-indexed and network-indexed). Privacy-aware decisions with sanitized (cleaned, imputed, anonymised) datasets and datastreams. Practical applications of these algorithms on real-world examples (eg. mobility, social media, machine sensors and logs). Illustration via industrial use-cases.

Expected Reference Readings

Note that you need to be logged into your library with access to these publishers:

Course Contents

The databricks notebooks will be made available as the course progresses in the :

Course Assessment

There will be minimal reading and coding exercises that will not be graded. The main assessment will be based on a peer-reviewed group project. The group project will include notebooks/codes along with a video of the project presentation. Each group cannot have more than four members and should be seen as an opportunity to do something you are passionate about or interested in, as opposed to completing and auto-gradeable programming assessment in the shortest amount of time.

Detailed instructions will be given in the sequel.

Course Sponsors

The course builds on contents developed since 2016 with support from New Zealand's Data Industry. The 2017-2019 versions were academically sponsored by Uppsala University's Inter-Faculty Course grant, Department of Mathematics and The Centre for Interdisciplinary Mathematics and industrially sponsored by databricks, AWS and Swedish data industry via Combient AB, SEB and Combient Mix AB. This 2021 version is academically sponsored by AI-Track of the WASP Graduate School and Centre for Interdisciplinary Mathematics and industrially sponsored by databricks and AWS via databricks University Alliance and Combient Mix AB via industrial mentorships.

Course Instructor

I, Raazesh Sainudiin or Raaz, will be an instructor for the course.

I have

  • more than 15 years of academic research experience in applied mathematics and statistics and
  • over 3 and 5 years of full-time and part-time experience in the data industry.

I currently (2020) have an effective joint appointment as:

Quick links on Raaz's background:

Industrial Case Study

We will see an industrial case-study that will illustrate a concrete data science process in action in the sequel.

What is the Data Science Process

The Data Science Process in one picture

what is sds?


What is scalable data science and distributed machine learning?

Scalability merely refers to the ability of the data science process to scale to massive datasets (popularly known as big data).

For this we need distributed fault-tolerant computing typically over large clusters of commodity computers -- the core infrastructure in a public cloud today.

Distributed Machine Learning allows the models in the data science process to be scalably trained and extract value from big data.

What is Data Science?

It is increasingly accepted that Data Science

is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.

Data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge and information science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.

Now, let us look at two industrially-informed academic papers that influence the above quote on what is Data Science, but with a view towards the contents and syllabus of this course.

Source: Vasant Dhar, Data Science and Prediction, Communications of the ACM, Vol. 56 (1). p. 64, DOI:10.1145/2500499

key insights in the above paper

  • Data Science is the study of the generalizabile extraction of knowledge from data.
  • A common epistemic requirement in assessing whether new knowledge is actionable for decision making is its predictive power, not just its ability to explain the past.
  • A data scientist requires an integrated skill set spanning
    • mathematics,
    • machine learning,
    • artificial intelligence,
    • statistics,
    • databases, and
    • optimization,
    • along with a deep understanding of the craft of problem formulation to engineer effective solutions.

Source: Machine learning: Trends, perspectives, and prospects, M. I. Jordan, T. M. Mitchell, Science 17 Jul 2015: Vol. 349, Issue 6245, pp. 255-260, DOI: 10.1126/science.aaa8415

key insights in the above paper

  • ML is concerned with the building of computers that improve automatically through experience
  • ML lies at the intersection of computer science and statistics and at the core of artificial intelligence and data science
  • Recent progress in ML is due to:
    • development of new algorithms and theory
    • ongoing explosion in the availability of online data
    • availability of low-cost computation (through clusters of commodity hardware in the cloud* )
  • The adoption of data science and ML methods is leading to more evidence-based decision-making across:
    • health sciences (neuroscience research, )
    • manufacturing
    • robotics (autonomous vehicle)
    • vision, speech processing, natural language processing
    • education
    • financial modeling
    • policing
    • marketing
Show code

But what is Data Engineering (including Machine Learning Engineering and Operations) and how does it relate to Data Science?

Data Engineering

There are several views on what a data engineer is supposed to do:

Some views are rather narrow and emphasise division of labour between data engineers and data scientists:

"Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more."

What skills do data engineers need? Those “10-30 different big data technologies” Anderson references in “Data engineers vs. data scientists” can fall under numerous areas, such as file formats, > ingestion engines, stream processing, batch processing, batch SQL, data storage, cluster management, transaction databases, web frameworks, data visualizations, and machine learning. And that’s just the tip of the iceberg.

Buss says data engineers should have the following skills and knowledge:

  • They need to know Linux and they should be comfortable using the command line.
  • They should have experience programming in at least Python or Scala/Java.
  • They need to know SQL.
  • They need some understanding of distributed systems in general and how they are different from traditional storage and processing systems.
  • They need a deep understanding of the ecosystem, including ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g. Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu). They should know the strengths and weaknesses of each tool and what it's best used for.
  • They need to know how to access and process data.

Let's dive deeper into such highly compartmentalised views of data engineers and data scientists and the so-called "machine learning engineers" according the following view:

embedded below.

Show code

The Data Engineering Scientist as "The Middle Way"

Here are some basic axioms that should be self-evident.

  • Yes, there are differences in skillsets across humans
    • some humans will be better and have inclinations for engineering and others for pure mathematics by nature and nurture
    • one human cannot easily be a master of everything needed for innovating a new data-based product or service (very very rarely though this happens)
  • Skills can be gained by any human who wants to learn to the extent s/he is able to expend time, energy, etc.

For the Scalable Data Engineering Science Process: towards Production-Ready and Productisable Prototyping for the Data-based Factory we need to allow each data engineer to be more of a data scientist and each data scientist to be more of a data engineer, up to each individual's comfort zones in technical and mathematical/conceptual and time-availability planes, but with some minimal expectations of mutual appreciation.

This course is designed to help you take the first minimal steps towards such a data engineering science.

In the sequel it will become apparent why a team of data engineering scientists with skills across the conventional (2021) spectrum of data engineer versus data scientist is crucial for Production-Ready and Productisable Prototyping for the Data-based Factory, whose outputs include standard AI products today.

Standing on shoulders of giants!

This course was originally structured from two other edX courses from 2015. Unfortunately, these courses and their content,including video lectures and slides, are not available openly any longer.

  • BerkeleyX/CS100-1x, Introduction to Big Data Using Apache Spark by Anthony A Joseph, Chancellor's Professor, UC Berkeley
  • BerkeleyX/CS190-1x, Scalable Machine Learning by Ameet Talwalkar, Ass. Prof., UC Los Angeles

This course will be an expanded and up-to-date scala version with an emphasis on individualized course project as opposed to completing labs that test sytactic skills that are auto-gradeable.

We will also be borrowing more theoretical aspects from the following course:

Note the Expected Reference Readings above for this course.