000_scalableDataEngineeringScience(Scala)

A bit about your instructor:

I, Raazesh Sainudiin or Raaz, will be your instructor for the course in data science. I have

  • more than 14 years of academic research experience in applied mathematics and statistics and
  • nearly 2 to 4 years of full-time to part-time experience in the data industry.

I currently (2019) have an effective joint appointment as:

Quick links on Raaz's background:

What is Scalable Data Science in one picture?

what is sds?


The Scalable Data Engineering Science Process:

Towards Production-Ready and Productisable Prototyping for the Content Factory

This can be summarised in Andrew Morgan's image of the Content Factory:

Andrew Morgan's Content Factory

Source: Vasant Dhar, Data Science and Prediction, Communications of the ACM, Vol. 56 (1). p. 64, DOI:10.1145/2500499

key insights

  • Data Science is the study of the generalizabile extraction of knowledge from data.
  • A common epistemic requirement in assessing whether new knowledge is actionable for decision making is its predictive power, not just its ability to explain the past.
  • A data scientist requires an integrated skill set spanning
    • mathematics,
    • machine learning,
    • artificial intelligence,
    • statistics,
    • databases, and
    • optimization,
    • along with a deep understanding of the craft of problem formulation to engineer effective solutions.

Source: Machine learning: Trends, perspectives, and prospects, M. I. Jordan, T. M. Mitchell, Science 17 Jul 2015: Vol. 349, Issue 6245, pp. 255-260, DOI: 10.1126/science.aaa8415

key insights

  • ML is concerned with the building of computers that improve automatically through experience
  • ML lies at the intersection of computer science and statistics and at the core of artificial intelligence and data science
  • Recent progress in ML is due to:
    • development of new algorithms and theory
    • ongoing explosion in the availability of online data
    • availability of low-cost computation (through clusters of commodity hardware in the cloud* )
  • The adoption of data science and ML methods is leading to more evidence-based decision-making across:
    • health sciences (neuroscience research, )
    • manufacturing
    • robotics (autonomous vehicle)
    • vision, speech processing, natural language processing
    • education
    • financial modeling
    • policing
    • marketing
Show code

Data Engineering

There are several views on what a data engineer is supposed to do:

Some views are rather narrow and emphasise division of labour between data engineers and data scientists:

"Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more."

What skills do data engineers need? Those “10-30 different big data technologies” Anderson references in “Data engineers vs. data scientists” can fall under numerous areas, such as file formats, > ingestion engines, stream processing, batch processing, batch SQL, data storage, cluster management, transaction databases, web frameworks, data visualizations, and machine learning. And that’s just the tip of the iceberg.

Buss says data engineers should have the following skills and knowledge:

  • They need to know Linux and they should be comfortable using the command line.
  • They should have experience programming in at least Python or Scala/Java.
  • They need to know SQL.
  • They need some understanding of distributed systems in general and how they are different from traditional storage and processing systems.
  • They need a deep understanding of the ecosystem, including ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g. Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu). They should know the strengths and weaknesses of each tool and what it's best used for.
  • They need to know how to access and process data.

Let's dive deeper into such highly compartmentalised views of data engineers and data scientists and the so-called "machine learning engineers" according the following view:

embedded below.

Show code

The Data Engineering Scientist as "The Middle Way"

Here are some basic axioms that should be self-evident.

  • Yes, there are differences in skillsets across humans
    • some humans will be better and have inclinations for engineering and others for pure mathematics by nature and nurture
    • one human cannot easily be a master of everything needed for innovating a new data-based product or service (very very rarely)
  • Skills can be gained by any human who wants to learn to the extent s/he is able to expend time, energy, etc.

For the Scalable Data Engineering Science Process: towards Production-Ready and Productisable Prototyping for the Data Factory we need to allow each data engineer to be more of a data scientist and each data scientist to be more of a data engineer, up to each individual's comfort zones in technical and mathematical/conceptual and time-availability planes, but with some minimal expectations of mutual appreciation.

This course is designed to help you take the first minimal steps towards data engineering science.

In the sequel it will become apparent why a team of data engineering scientists with skills across the conventional (2019) spectrum of data engineer versus data scientist is crucial for Production-Ready and Productisable Prototyping for the Data Factory.

Standing on shoulders of giants!

This course will build on two other edX courses where needed.

We encourage you to take these courses if you have more time. For those of you (including the course coordinator) who have taken these courses formally in 2015, this course will be an expanded scala version with an emphasis on individualized course project as opposed to completing labs that test sytactic skills.

We will also be borrowing more theoretical aspects from the following course:

The first two recommended readings below are (already somewhat outdated!), the third one is advanced but current now and the fourth one is in progress:

  • Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, 2015.
  • Advanced analytics with Spark : patterns for learning from data at scale, Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, O'Reilly, 2015.
  • High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark, Holden Karau, Rachel Warren, O'Reilly, 2017.
  • Mastering Spark for Data Science By Andrew Morgan, Antoine Amend, David George, Matthew Hallett, Packt Publishing, 2017.
  • Spark: The Definitive Guide, Big Data Processing Made Simple By Matei Zaharia, Bill Chambers, O'Reilly Media, 2018.

How will you be assessed?

There will be minimal exercises and an open mini-project you can do and present briefly to us.

You will also be working on a bigger project in a small team that this training will prepare you for. This project can be hopefully turned into an asset.

A Brief History of Data Analysis and Where Does "Big Data" Come From?

by Anthony Joseph in BerkeleyX/CS100.1x

  • (watch now 1:53): A Brief History of Data Analysis
    • A Brief History of Data Analysis by Anthony Joseph in BerkeleyX/CS100.1x
  • (watch now 5:05): Where does Data Come From?
    • Where Does Data Come From by Anthony Joseph in BerkeleyX/CS100.1x
    • SUMMARY of Some of the sources of big data.
      • online click-streams (a lot of it is recorded but a tiny amount is analyzed):
        • record every click
        • every ad you view
        • every billing event,
        • every transaction, every network message, and every fault.
      • User-generated content (on web and mobile devices):
        • every post that you make on Facebook
        • every picture sent on Instagram
        • every review you write for Yelp or TripAdvisor
        • every tweet you send on Twitter
        • every video that you post to YouTube.
      • Science (for scientific computing):
        • data from various repositories for natural language processing:
          • Wikipedia,
          • the Library of Congress,
          • twitter firehose and google ngrams and digital archives,
        • data from scientific instruments/sensors/computers:
          • the Large Hadron Collider (more data in a year than all the other data sources combined!)
          • genome sequencing data (sequencing cost is dropping much faster than Moore's Law!)
          • output of high-performance computers (super-computers) for data fusion, estimation/prediction and exploratory data analysis
        • Graphs are also an interesting source of big data (network science).
        • social networks (collaborations, followers, fb-friends or other relationships),
        • telecommunication networks,
        • computer networks,
        • road networks
        • machine logs:
        • by servers around the internet (hundreds of millions of machines out there!)
        • internet of things.

Data Science Defined, Cloud Computing and What's Hard About Data Science?

by Anthony Joseph in BerkeleyX/CS100.1x

  • (watch now 2:03): Data Science Defined
    • Data Science Defined by Anthony Joseph in BerkeleyX/CS100.1x
  • (watch now 1:11): Cloud Computing
    • Cloud Computing by Anthony Joseph in BerkeleyX/CS100.1x
    • In fact, if you are logged into https://*.databricks.com/* you are computing in the cloud!
    • The Scalable Data Science course is supported by Databricks Academic Partners Program and the AWS Educate Grant to University of Canterbury (applied by Raaz Sainudiin in 2015).
  • (watch now 3:31): What's hard about data science
    • What's hard about data science by Anthony Joseph in BerkeleyX/CS100.1x

(Watch later 0:52): What is Data Science? According to a Udacity Course.

What is Data Science? Udacity Course

What should you be able to do at the end of this course?

  • by following these sessions and doing some HOMEWORK assignments.

Understand the principles of fault-tolerant scalable computing in Spark

  • in-memory and generic DAG extensions of Map-reduce
  • resilient distributed datasets for fault-tolerance
  • skills to process today's big data using state-of-the art techniques in Apache Spark 2.2, in terms of:
    • hands-on coding with real datasets
    • an intuitive (non-mathematical) understanding of the ideas behind the technology and methods
    • pointers to academic papers in the literature, technical blogs and video streams for you to futher your theoretical understanding.

More concretely, you will be able to:

Part 1: Days 1-4 of training (focused on data engineering)

1.1 Extract, Transform, Load, Interact, Explore and Analyze Data

(watch later) Exploring Apache Web Logs (semi-structured data)

Databricks jump start

(watch later) Exploring Wikipedia Click Streams (structured data)

Michael Armbrust Spark Summit East

1.2 ETL and SQL on Graphs or network data

1.3 Working with Structured Streaming Data

Part 2: Days 5-6 of training

Computational, Mathematical and Statistical Foundations for Data Scientists and Engineers

Here we will use SageMath to get engineers and scientists on the same mathematical page starting from set theory, axiomatic probability theory, statistical decision theory, pseudorandom number generators from first principles, simulation of random variables and random structures including graphs, convergence of random variables, weak law of large numbers, central limit theorem, estimators, and hypothesis tests (parametric and nonparametric) and the principles of statistical learning theory (the mathematics behind machine learning).

Part 3: Days 7-12 of training

3. Build Scalable Machine Learning Pipelines (or help build them)

Apply standard learning methods via scalably servable end-to-end industrial ML pipelines

ETL, Model, Validate, Test, reETL (feature re-engineer), model validate, test,..., serve model to clients

(we will choose from this list for training days 7-12)
  • Supervised Learning Methods: Regression /Classification
  • Unsupervised Learning Methods: Clustering
  • Recommedation systems
  • Streaming
  • Graph processing
  • Geospatial data-processing
  • Topic modeling
  • Deep Learning
  • ...

Part 4: Day 13 is Open-Surgery of Production-ready Prototypes/Projects



(watch later) Spark Summit 2015 demo: Creating an end-to-end machine learning data pipeline with Databricks (Live Sentiment Analysis)

Ali G's Live Sentiment Analysist

(watch later) Spark Summit 2017 - Expanding Apache Spark Use Cases in 2.x and Beyond - Matei Zaharia, Tim Hunter & Michael Armbrust - Deep Learning and Structured Streaming

Expanding Apache Spark Use Cases in 2.2 and Beyond - Matei Zaharia, Tim Hunter & Michael Armbrust - Spark Summit 2017 - Deep Learning and Structured Streaming

Recent videos are archived here (these videos are a great way to have lunch over with your mates!):

Navigate to the bottom of the next embed and click on video archives link.

Show code


65 minutes of 90 minutes are up!

EXTRA: Databases Versus Data Science

by Anthony Joseph in BerkeleyX/CS100.1x

  • (watch later 2:31): Why all the excitement about Big Data Analytics? (using google search to now-cast google flu-trends)
    • A Brief History of Data Analysis by Anthony Joseph in BerkeleyX/CS100.1x
  • other interesting big data examples - recommender systems and netflix prize?

  • (watch later 10:41): Contrasting data science with traditional databases, ML, Scientific computing

    • Data Science Database Contrast by Anthony Joseph in BerkeleyX/CS100.1x
    • SUMMARY:
      • traditional databases versus data science
        • preciousness versus cheapness of the data
        • ACID and eventual consistency, CAP theorem, ...
        • interactive querying: SQL versus noSQL
        • querying the past versus querying/predicting the future
      • traditional scientific computing versus data science
        • science-based or mechanistic models versus data-driven black-box (deep-learning) statistical models (of course both schools co-exist)
        • super-computers in traditional science-based models versus cluster of commodity computers
      • traditional ML versus data science
        • smaller amounts of clean data in traditional ML versus massive amounts of dirty data in data science
        • traditional ML researchers try to publish academic papers versus data scientists try to produce actionable intelligent systems
  • (watch later 1:49): Three Approaches to Data Science
    • Approaches to Data Science by Anthony Joseph in BerkeleyX/CS100.1x
  • (watch later 4:29): Performing Data Science and Preparing Data, Data Acquisition and Preparation, ETL, ...
    • Data Science Database Contrast by Anthony Joseph in BerkeleyX/CS100.1x
  • (watch later 2:01): Four Examples of Data Science Roles
    • Data Science Roles by Anthony Joseph in BerkeleyX/CS100.1x
    • SUMMARY of Data Science Roles.
      • individual roles:
        1. business person
        2. programmer
      • organizational roles:
        1. enterprise
        2. web company
    • Each role has it own unique set of:
      • data sources
      • Extract-Transform-Load (ETL) process
      • business intelligence and analytics tools
    • Most Maths/Stats/Computing programs cater to the programmer role
      • Numpy and Matplotlib, R, Matlab, and Octave.