Introduction
- Course Name: Scalable Data Science and Distributed Machine Learning
- Course Acronym: ScaDaMaLe or sds-3.x.
The course was designed to be the fifth and final mandatory course in the AI-Track of the WASP Graduate School in 2021. From 2022 ScaDaMaLe is an optional course for WASP students who have successfully completed the mandatory courses. It is given in three modules. In addition to academic lectures there are invited guest speakers from industry.
The course can also be taken by select post-graduate students at Uppsala University as a Special Topics Course from the Department of Mathematics.
This site provides course contents for the three modules. This content is referred to as sds-3.x here.
Module 1 – Introduction to Data Science: Introduction to fault-tolerant distributed file systems and computing.
The whole data science process illustrated with industrial case-studies. Practical introduction to scalable data processing to ingest, extract, load, transform, and explore (un)structured datasets. Scalable machine learning pipelines to model, train/fit, validate, select, tune, test and predict or estimate in an unsupervised and a supervised setting using nonparametric and partitioning methods such as random forests. Introduction to distributed vertex-programming.
Module 2 – Distributed Deep Learning: Introduction to the theory and implementation of distributed deep learning.
Classification and regression using generalised linear models, including different learning, regularization, and hyperparameters tuning techniques. The feedforward deep network as a fundamental network, and the advanced techniques to overcome its main challenges, such as overfitting, vanishing/exploding gradient, and training speed. Various deep neural networks for various kinds of data. For example, the CNN for scaling up neural networks to process large images, RNN to scale up deep neural models to long temporal sequences, and autoencoder and GANs.
Module 3 – Decision-making with Scalable Algorithms
Theoretical foundations of distributed systems and analysis of their scalable algorithms for sorting, joining, streaming, sketching, optimising and computing in numerical linear algebra with applications in scalable machine learning pipelines for typical decision problems (eg. prediction, A/B testing, anomaly detection) with various types of data (eg. time-indexed, space-time-indexed and network-indexed). Privacy-aware decisions with sanitized (cleaned, imputed, anonymised) datasets and datastreams. Practical applications of these algorithms on real-world examples (eg. mobility, social media, machine sensors and logs). Illustration via industrial use-cases.
Expected Reference Readings
Note that you need to be logged into your library with access to these publishers:
- https://learning.oreilly.com/library/view/high-performance-spark/9781491943199/
- https://learning.oreilly.com/library/view/spark-the-definitive/9781491912201/
- https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/
- Introduction to Algorithms, Third Edition, Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein from
- Reading Materials Provided
Course Contents
The databricks notebooks will be made available as the course progresses in the : - course site at: - [site](https://lamastex.github.io/scalable-data-science/sds/3/x/) and [book](https://lamastex.github.io/ScaDaMaLe/index.html) - and course book at: - https://lamastex.github.io/ScaDaMaLe/index.html
- You may upload Course Content into Databricks Community Edition from:
Course Assessment
There will be minimal reading and coding exercises that will not be graded. The main assessment will be based on a peer-reviewed group project. The group project will include notebooks/codes along with a video of the project presentation. Each group cannot have more than four members and should be seen as an opportunity to do something you are passionate about or interested in, as opposed to completing and auto-gradeable programming assessment in the shortest amount of time.
Detailed instructions will be given in the sequel.
Course Sponsors
The course builds on contents developed since 2016 with support from New Zealand's Data Industry. The 2017-2019 versions were academically sponsored by Uppsala University's Inter-Faculty Course grant, Department of Mathematics and The Centre for Interdisciplinary Mathematics and industrially sponsored by databricks, AWS and Swedish data industry via Combient AB, SEB and Combient Mix AB. This 2021 version is academically sponsored by AI-Track of the WASP Graduate School and Centre for Interdisciplinary Mathematics and industrially sponsored by databricks and AWS via databricks University Alliance and Combient Mix AB via industrial mentorships.
Course Instructor
I, Raazesh Sainudiin or Raaz, will be an instructor for the course.
I have
- more than 15 years of academic research experience in applied mathematics and statistics and
- over 3 and 5 years of full-time and part-time experience in the data industry.
I currently (2020) have an effective joint appointment as:
- Associate Professor of Mathematics with specialisation in Data Science at Department of Mathematics, Uppsala University, Uppsala, Sweden and
- Director, Technical Strategy and Research at Combient Mix AB, Stockholm, Sweden
Quick links on Raaz's background:
Industrial Case Study
We will see an industrial case-study that will illustrate a concrete data science process in action in the sequel.
What is the Data Science Process
The Data Science Process in one picture
What is scalable data science and distributed machine learning?
Scalability merely refers to the ability of the data science process to scale to massive datasets (popularly known as big data).
For this we need distributed fault-tolerant computing typically over large clusters of commodity computers -- the core infrastructure in a public cloud today.
Distributed Machine Learning allows the models in the data science process to be scalably trained and extract value from big data.
What is Data Science?
It is increasingly accepted that Data Science
is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.
Data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge and information science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.
Now, let us look at two industrially-informed academic papers that influence the above quote on what is Data Science, but with a view towards the contents and syllabus of this course.
key insights in the above paper
- Data Science is the study of the generalizabile extraction of knowledge from data.
- A common epistemic requirement in assessing whether new knowledge is actionable for decision making is its predictive power, not just its ability to explain the past.
- A data scientist requires an integrated skill set spanning
- mathematics,
- machine learning,
- artificial intelligence,
- statistics,
- databases, and
- optimization,
- along with a deep understanding of the craft of problem formulation to engineer effective solutions.
key insights in the above paper
- ML is concerned with the building of computers that improve automatically through experience
- ML lies at the intersection of computer science and statistics and at the core of artificial intelligence and data science
- Recent progress in ML is due to:
- development of new algorithms and theory
- ongoing explosion in the availability of online data
- availability of low-cost computation (*through clusters of commodity hardware in the *cloud* )
- The adoption of data science and ML methods is leading to more evidence-based decision-making across:
- health sciences (neuroscience research, )
- manufacturing
- robotics (autonomous vehicle)
- vision, speech processing, natural language processing
- education
- financial modeling
- policing
- marketing
But what is Data Engineering (including Machine Learning Engineering and Operations) and how does it relate to Data Science?
Data Engineering
There are several views on what a data engineer is supposed to do:
Some views are rather narrow and emphasise division of labour between data engineers and data scientists:
- https://www.oreilly.com/ideas/data-engineering-a-quick-and-simple-definition
- Let's check out what skills a data engineer is expected to have according to the link above.
"Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more."
What skills do data engineers need? Those “10-30 different big data technologies” Anderson references in “Data engineers vs. data scientists” can fall under numerous areas, such as file formats, > ingestion engines, stream processing, batch processing, batch SQL, data storage, cluster management, transaction databases, web frameworks, data visualizations, and machine learning. And that’s just the tip of the iceberg.
Buss says data engineers should have the following skills and knowledge:
- They need to know Linux and they should be comfortable using the command line.
- They should have experience programming in at least Python or Scala/Java.
- They need to know SQL.
- They need some understanding of distributed systems in general and how they are different from traditional storage and processing systems.
- They need a deep understanding of the ecosystem, including ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g. Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu). They should know the strengths and weaknesses of each tool and what it's best used for.
- They need to know how to access and process data.
Let's dive deeper into such highly compartmentalised views of data engineers and data scientists and the so-called "machine learning engineers" according the following view:
- https://www.oreilly.com/ideas/data-engineers-vs-data-scientists
embedded below.
The Data Engineering Scientist as "The Middle Way"
Here are some basic axioms that should be self-evident.
- Yes, there are differences in skillsets across humans
- some humans will be better and have inclinations for engineering and others for pure mathematics by nature and nurture
- one human cannot easily be a master of everything needed for innovating a new data-based product or service (very very rarely though this happens)
- Skills can be gained by any human who wants to learn to the extent s/he is able to expend time, energy, etc.
For the Scalable Data Engineering Science Process: towards Production-Ready and Productisable Prototyping for the Data-based Factory we need to allow each data engineer to be more of a data scientist and each data scientist to be more of a data engineer, up to each individual's comfort zones in technical and mathematical/conceptual and time-availability planes, but with some minimal expectations of mutual appreciation.
This course is designed to help you take the first minimal steps towards such a data engineering science.
In the sequel it will become apparent why a team of data engineering scientists with skills across the conventional (2021) spectrum of data engineer versus data scientist is crucial for Production-Ready and Productisable Prototyping for the Data-based Factory, whose outputs include standard AI products today.
Standing on shoulders of giants!
This course was originally structured from two other edX courses from 2015. Unfortunately, these courses and their content,including video lectures and slides, are not available openly any longer.
- BerkeleyX/CS100-1x, Introduction to Big Data Using Apache Spark by Anthony A Joseph, Chancellor's Professor, UC Berkeley
- BerkeleyX/CS190-1x, Scalable Machine Learning by Ameet Talwalkar, Ass. Prof., UC Los Angeles
This course will be an expanded and up-to-date scala version with an emphasis on individualized course project as opposed to completing labs that test sytactic skills that are auto-gradeable.
We will also be borrowing more theoretical aspects from the following course:
Note the Expected Reference Readings above for this course.
A Brief Tour of Data Science
History of Data Analysis and Where Does "Big Data" Come From?
-
A Brief History and Timeline of Data Analysis and Big Data
-
https://whatis.techtarget.com/feature/A-history-and-timeline-of-big-data
-
Where does Data Come From?
-
Some of the sources of big data.
- online click-streams (a lot of it is recorded but a tiny amount is analyzed):
- record every click
- every ad you view
- every billing event,
- every transaction, every network message, and every fault.
- User-generated content (on web and mobile devices):
- every post that you make on Facebook
- every picture sent on Instagram
- every review you write for Yelp or TripAdvisor
- every tweet you send on Twitter
- every video that you post to YouTube.
- Science (for scientific computing):
- data from various repositories for natural language processing:
- Wikipedia,
- the Library of Congress,
- twitter firehose and google ngrams and digital archives,
- data from scientific instruments/sensors/computers:
- the Large Hadron Collider (more data in a year than all the other data sources combined!)
- genome sequencing data (sequencing cost is dropping much faster than Moore's Law!)
- output of high-performance computers (super-computers) for data fusion, estimation/prediction and exploratory data analysis
- data from various repositories for natural language processing:
- Graphs are also an interesting source of big data (network science).
- social networks (collaborations, followers, fb-friends or other relationships),
- telecommunication networks,
- computer networks,
- road networks
- machine logs:
- by servers around the internet (hundreds of millions of machines out there!)
- internet of things.
- online click-streams (a lot of it is recorded but a tiny amount is analyzed):
Data Science with Cloud Computing and What's Hard about it?
- See Cloud Computing to understand the work-horse for analysing big data at data centers
Cloud computing is the on-demand availability of computer system resources, especially data storage (cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over multiple locations, each location being a data center. Cloud computing relies on sharing of resources to achieve coherence and economies of scale, typically using a "pay-as-you-go" model which can help in reducing capital expenses but may also lead to unexpected operating expenses for unaware users.
-
In fact, if you are logged into
https://*.databricks.com/*
you are computing in the cloud! So the computations are actually running in an instance of the hardware available at a data center like the following: -
Here is a data center used by CERN in 2010.
-
What's hard about scalable data science in the cloud?
- To analyse datasets that are big, say more than a few TBs, we need to split the data and put it in several computers that are networked - *a typical cloud *
- However, as the number of computer nodes in such a network increases, the probability of hardware failure or fault (say the hard-disk or memory or CPU or switch breaking down) also increases and can happen while the computation is being performed
- Therefore for scalable data science, i.e., data science that can scale with the size of the input data by adding more computer nodes, we need fault-tolerant computing and storage framework at the software level to ensure the computations finish even if there are hardware faults.
Here is a recommended light reading on What is "Big Data" -- Understanding the History (18 minutes): - https://towardsdatascience.com/what-is-big-data-understanding-the-history-32078f3b53ce
What should you be able to do at the end of this course?
By following these online interactions in the form of lab/lectures, asking questions, engaging in discussions, doing HOMEWORK assignments and completing the group project, you should be able to:
- Understand the principles of fault-tolerant scalable computing in Spark
- in-memory and generic DAG extensions of Map-reduce
- resilient distributed datasets for fault-tolerance
- skills to process today's big data using state-of-the art techniques in Apache Spark 3.0, in terms of:
- hands-on coding with realistic datasets
- an intuitive understanding of the ideas behind the technology and methods
- pointers to academic papers in the literature, technical blogs and video streams for you to futher your theoretical understanding.
- More concretely, you will be able to:
- Extract, Transform, Load, Interact, Explore and Analyze Data
- Build Scalable Machine Learning Pipelines (or help build them) using Distributed Algorithms and Optimization
- How to keep up?
- This is a fast-changing world.
- Recent videos around Apache Spark are archived here (these videos are a great way to learn the latest happenings in industrial R&D today!):
- What is mathematically stable in the world of 'big data'?
- There is a growing body of work on the analysis of parallel and distributed algorithms, the work-horse of big data and AI.
- We will see some of this in a theoretical module later, but the immediate focus here is on how to write programs and analyze data.