A bit about your instructor:

I, Raazesh Sainudiin or Raaz, will be your instructor for the course in data science. I have

more than 14 years of academic research experience in applied mathematics and statistics and
nearly 2 to 4 years of full-time to part-time experience in the data industry.

I currently (2019) have an effective joint appointment as:

Associate Professor of Mathematics with specialisation in Data Science at Department of Mathematics, Uppsala University, Uppsala, Sweden and
Principal Data Scientist at Combient AB, Stockholm, Sweden

Quick links on Raaz's background:

Source: Vasant Dhar, Data Science and Prediction, Communications of the ACM, Vol. 56 (1). p. 64, DOI:10.1145/2500499

key insights

Data Science is the study of the generalizabile extraction of knowledge from data.
A common epistemic requirement in assessing whether new knowledge is actionable for decision making is its predictive power, not just its ability to explain the past.
A data scientist requires an integrated skill set spanning
- mathematics,
- machine learning,
- artificial intelligence,
- statistics,
- databases, and
- optimization,
- along with a deep understanding of the craft of problem formulation to engineer effective solutions.

Source: Machine learning: Trends, perspectives, and prospects, M. I. Jordan, T. M. Mitchell, Science 17 Jul 2015: Vol. 349, Issue 6245, pp. 255-260, DOI: 10.1126/science.aaa8415

key insights

ML is concerned with the building of computers that improve automatically through experience
ML lies at the intersection of computer science and statistics and at the core of artificial intelligence and data science
Recent progress in ML is due to:
- development of new algorithms and theory
- ongoing explosion in the availability of online data
- availability of low-cost computation (through clusters of commodity hardware in the cloud* )
The adoption of data science and ML methods is leading to more evidence-based decision-making across:
- health sciences (neuroscience research, )
- manufacturing
- robotics (autonomous vehicle)
- vision, speech processing, natural language processing
- education
- financial modeling
- policing
- marketing

Show code

%md
# Data Engineering

There are several views on what a data engineer is supposed to do:

Some views are rather narrow and emphasise division of labour between data engineers and data scientists:
 
- https://www.oreilly.com/ideas/data-engineering-a-quick-and-simple-definition
  - Let's check out what skills a data engineer is expected to have according to the link above.

> "Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more."

> What skills do data engineers need?
> Those “10-30 different big data technologies” Anderson references in “Data engineers vs. data scientists” can fall under numerous areas, such as file formats, > ingestion engines, stream processing, batch processing, batch SQL, data storage, cluster management, transaction databases, web frameworks, data visualizations, and machine learning. And that’s just the tip of the iceberg.

> Buss says data engineers should have the following skills and knowledge:

> - They need to know Linux and they should be comfortable using the command line.
> - They should have experience programming in at least Python or Scala/Java.
> - They need to know SQL.
> - They need some understanding of distributed systems in general and how they are different from traditional storage and processing systems.
> - They need a deep understanding of the ecosystem, including ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g. Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu). They should know the strengths and weaknesses of each tool and what it's best used for.
> - They need to know how to access and process data.

Let's dive deeper into such highly compartmentalised views of data engineers and data scientists and the so-called "machine learning engineers" according the following view:

- https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

embedded below.

Data Engineering

There are several views on what a data engineer is supposed to do:

Some views are rather narrow and emphasise division of labour between data engineers and data scientists:

https://www.oreilly.com/ideas/data-engineering-a-quick-and-simple-definition
- Let's check out what skills a data engineer is expected to have according to the link above.

"Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more."

What skills do data engineers need? Those “10-30 different big data technologies” Anderson references in “Data engineers vs. data scientists” can fall under numerous areas, such as file formats, > ingestion engines, stream processing, batch processing, batch SQL, data storage, cluster management, transaction databases, web frameworks, data visualizations, and machine learning. And that’s just the tip of the iceberg.

Buss says data engineers should have the following skills and knowledge:

They need to know Linux and they should be comfortable using the command line.

They should have experience programming in at least Python or Scala/Java.

They need to know SQL.

They need some understanding of distributed systems in general and how they are different from traditional storage and processing systems.

They need a deep understanding of the ecosystem, including ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g. Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu). They should know the strengths and weaknesses of each tool and what it's best used for.

They need to know how to access and process data.

Let's dive deeper into such highly compartmentalised views of data engineers and data scientists and the so-called "machine learning engineers" according the following view:

https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

embedded below.

Show code

The Data Engineering Scientist as "The Middle Way"

Here are some basic axioms that should be self-evident.

Yes, there are differences in skillsets across humans
- some humans will be better and have inclinations for engineering and others for pure mathematics by nature and nurture
- one human cannot easily be a master of everything needed for innovating a new data-based product or service (very very rarely)
Skills can be gained by any human who wants to learn to the extent s/he is able to expend time, energy, etc.

For the Scalable Data Engineering Science Process: towards Production-Ready and Productisable Prototyping for the Data Factory we need to allow each data engineer to be more of a data scientist and each data scientist to be more of a data engineer, up to each individual's comfort zones in technical and mathematical/conceptual and time-availability planes, but with some minimal expectations of mutual appreciation.

This course is designed to help you take the first minimal steps towards data engineering science.

In the sequel it will become apparent why a team of data engineering scientists with skills across the conventional (2019) spectrum of data engineer versus data scientist is crucial for Production-Ready and Productisable Prototyping for the Data Factory.

%md
## Standing on shoulders of giants!

This course will build on two other edX courses where needed.  

* [BerkeleyX/CS100-1x, Introduction to Big Data Using Apache Spark by Anthony A Joseph, Chancellor's Professor, UC Berkeley](https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x)
* [BerkeleyX/CS190-1x, Scalable Machine Learning by Ameet Talwalkar, Ass. Prof., UC Los Angeles](https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x)

We encourage you to take these courses if you have more time.  For those of you (including the course coordinator) who have taken these courses formally in 2015, this course will be an *expanded scala version* with an emphasis on *individualized course project* as opposed to completing labs that test sytactic skills. 

We will also be borrowing more theoretical aspects from the following course:

* [Stanford/CME323, Distributed Algorithms and Optimization by Reza Zadeh, Ass. Prof., Institute for Computational and Mathematical Engineering, Stanford Univ.](http://stanford.edu/~rezab/dao/)

The first two recommended readings below are (already somewhat outdated!), the third one is advanced but current now and the fourth one is in progress:

* Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, 2015.
* Advanced analytics with Spark : patterns for learning from data at scale, Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, O'Reilly, 2015.
* High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark, Holden Karau, Rachel Warren, O'Reilly, 2017.
* Mastering Spark for Data Science By Andrew Morgan, Antoine Amend, David George, Matthew Hallett, Packt Publishing, 2017.
* Spark: The Definitive Guide, Big Data Processing Made Simple By Matei Zaharia, Bill Chambers, O'Reilly Media, 2018.

<img src="http://www.syndetics.com/index.aspx?type=xw12&isbn=9781449358624/LC.GIF&client=ucanterburyl&upc=&oclc="  height="150" width="150">
<img src="http://t3.gstatic.com/images?q=tbn:ANd9GcSQs35NvHVozz77dhXYc2Ce8lKyJkR3oVwaxyA5Ub4W7Kvtvf9i"   height="150" width="150">
<img src="http://t2.gstatic.com/images?q=tbn:ANd9GcS7XN41_u0B8XehDmtXLJeuEPgnuULz16oFMRoANYz2e1-Vog3D"   height="150" width="150">
<img src="https://images-na.ssl-images-amazon.com/images/I/51wFajfF5oL._SX258_BO1,204,203,200_.jpg"  height="150" width="150">
<img src="https://covers.oreillystatic.com/images/0636920034957/lrg.jpg"  height="150" width="150">

Standing on shoulders of giants!

This course will build on two other edX courses where needed.

We encourage you to take these courses if you have more time. For those of you (including the course coordinator) who have taken these courses formally in 2015, this course will be an expanded scala version with an emphasis on individualized course project as opposed to completing labs that test sytactic skills.

We will also be borrowing more theoretical aspects from the following course:

Stanford/CME323, Distributed Algorithms and Optimization by Reza Zadeh, Ass. Prof., Institute for Computational and Mathematical Engineering, Stanford Univ.

The first two recommended readings below are (already somewhat outdated!), the third one is advanced but current now and the fourth one is in progress:

Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, 2015.
Advanced analytics with Spark : patterns for learning from data at scale, Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, O'Reilly, 2015.
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark, Holden Karau, Rachel Warren, O'Reilly, 2017.
Mastering Spark for Data Science By Andrew Morgan, Antoine Amend, David George, Matthew Hallett, Packt Publishing, 2017.
Spark: The Definitive Guide, Big Data Processing Made Simple By Matei Zaharia, Bill Chambers, O'Reilly Media, 2018.

%md
## A Brief History of Data Analysis and Where Does "Big Data" Come From?
#### by Anthony Joseph in BerkeleyX/CS100.1x

* **(watch now 1:53):** A Brief History of Data Analysis
  * [![A Brief History of Data Analysis by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/5fSSvYlDkag/0.jpg)](https://www.youtube.com/watch?v=5fSSvYlDkag)
  
  
* **(watch now 5:05)**: Where does Data Come From?
  * [![Where Does Data Come From by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/eEJFlHE7Gt4/0.jpg)](https://www.youtube.com/watch?v=eEJFlHE7Gt4?rel=0&autoplay=1&modestbranding=1)
  * SUMMARY of Some of the sources of big data.
     * online click-streams (a lot of it is recorded but a tiny amount is analyzed):
       * record every click
       * every ad you view
       * every billing event,
       * every transaction, every network message, and every fault.
     * User-generated content (on web and mobile devices):
       * every post that you make on Facebook 
       * every picture sent on Instagram
       * every review you write for Yelp or TripAdvisor
       * every tweet you send on Twitter
       * every video that you post to YouTube.
     * Science (for scientific computing):
       * data from various repositories for natural language processing:
          * Wikipedia,
          * the Library of Congress, 
          * twitter firehose and google ngrams and digital archives,
       * data from scientific instruments/sensors/computers:
         * the Large Hadron Collider (more data in a year than all the other data sources combined!)
         * genome sequencing data (sequencing cost is dropping much faster than Moore's Law!)
         * output of high-performance computers (super-computers) for data fusion, estimation/prediction and exploratory data analysis
    * Graphs are also an interesting source of big data (*network science*).
      * social networks (collaborations, followers, fb-friends or other relationships),
      * telecommunication networks, 
      * computer networks,
      * road networks
    * machine logs:
      * by servers around the internet (hundreds of millions of machines out there!)
      * internet of things.

A Brief History of Data Analysis and Where Does "Big Data" Come From?

by Anthony Joseph in BerkeleyX/CS100.1x

(watch now 1:53): A Brief History of Data Analysis

(watch now 5:05): Where does Data Come From?
- SUMMARY of Some of the sources of big data.
  - online click-streams (a lot of it is recorded but a tiny amount is analyzed):
    - record every click
    - every ad you view
    - every billing event,
    - every transaction, every network message, and every fault.
  - User-generated content (on web and mobile devices):
    - every post that you make on Facebook
    - every picture sent on Instagram
    - every review you write for Yelp or TripAdvisor
    - every tweet you send on Twitter
    - every video that you post to YouTube.
  - Science (for scientific computing):
    - data from various repositories for natural language processing:
      - Wikipedia,
      - the Library of Congress,
      - twitter firehose and google ngrams and digital archives,
    - data from scientific instruments/sensors/computers:
      - the Large Hadron Collider (more data in a year than all the other data sources combined!)
      - genome sequencing data (sequencing cost is dropping much faster than Moore's Law!)
      - output of high-performance computers (super-computers) for data fusion, estimation/prediction and exploratory data analysis
    - Graphs are also an interesting source of big data (network science).
    - social networks (collaborations, followers, fb-friends or other relationships),
    - telecommunication networks,
    - computer networks,
    - road networks
    - machine logs:
    - by servers around the internet (hundreds of millions of machines out there!)
    - internet of things.

Data Science Defined, Cloud Computing and What's Hard About Data Science?

by Anthony Joseph in BerkeleyX/CS100.1x

(watch now 2:03): Data Science Defined
(watch now 1:11): Cloud Computing
- In fact, if you are logged into https://*.databricks.com/* you are computing in the cloud!
- The Scalable Data Science course is supported by Databricks Academic Partners Program and the AWS Educate Grant to University of Canterbury (applied by Raaz Sainudiin in 2015).
(watch now 3:31): What's hard about data science

%md
# What should *you* be able to do at the end of this course?

* by following these sessions and doing some HOMEWORK assignments.

## Understand the principles of fault-tolerant scalable computing in Spark

* in-memory and generic DAG extensions of Map-reduce
* resilient distributed datasets for fault-tolerance
* skills to process today's big data using state-of-the art techniques in Apache Spark 2.2, in terms of:
  * hands-on coding with real datasets
  * an intuitive (non-mathematical) understanding of the ideas behind the technology and methods
  * pointers to academic papers in the literature, technical blogs and video streams for *you to futher your theoretical understanding*.

# More concretely, you will be able to:

## Part 1: Days 1-4 of training (focused on data engineering)

### 1.1 Extract, Transform, Load, Interact, Explore and Analyze Data

#### (watch later) Exploring Apache Web Logs (semi-structured data) 
[![Databricks jump start](https://raw.githubusercontent.com/raazesh-sainudiin/scalable-data-science/master/images/dataExploreWebLogsSQL.png)](https://vimeo.com/137874931)

#### (watch later) Exploring Wikipedia Click Streams (structured data) 
[![Michael Armbrust Spark Summit East](http://img.youtube.com/vi/35Y-rqSMCCA/0.jpg)](https://www.youtube.com/watch?v=35Y-rqSMCCA)

#### 1.2 ETL and SQL on Graphs or network data

#### 1.3 Working with Structured Streaming Data

## Part 2: Days 5-6 of training 
### Computational, Mathematical and Statistical Foundations for Data Scientists and Engineers
Here we will use [SageMath](http://www.sagemath.org/) to get engineers and scientists on the same mathematical page starting from set theory, axiomatic probability theory, statistical decision theory, pseudorandom number generators from first principles, simulation of random variables and random structures including graphs, convergence of random variables, weak law of large numbers, central limit theorem, estimators, and hypothesis tests (parametric and nonparametric) and the principles of statistical learning theory (the mathematics behind machine learning). 

## Part 3: Days 7-12 of training
### 3. Build Scalable Machine Learning Pipelines (or help build them)

### Apply standard learning methods via scalably servable *end-to-end industrial ML pipelines*
#### ETL, Model, Validate, Test, reETL (feature re-engineer), model validate, test,..., serve model to clients
##### (we will choose from this list for training days 7-12)

* Supervised Learning Methods: Regression /Classification
* Unsupervised Learning Methods: Clustering
* Recommedation systems
* Streaming
* Graph processing
* Geospatial data-processing
* Topic modeling
* Deep Learning
* ...

## Part 4: Day 13 is Open-Surgery of Production-ready Prototypes/Projects

---

---

####  (watch later) Spark Summit 2015 demo: Creating an end-to-end machine learning data pipeline with Databricks (Live Sentiment Analysis)
[![Ali G's Live Sentiment Analysist](http://img.youtube.com/vi/NR1MYg_7oSg/0.jpg)](https://www.youtube.com/watch?v=NR1MYg_7oSg)

#### (watch later) Spark Summit 2017 - Expanding Apache Spark Use Cases in 2.x and Beyond - Matei Zaharia, Tim Hunter & Michael Armbrust - Deep Learning and Structured Streaming
[![Expanding Apache Spark Use Cases in 2.2 and Beyond - Matei Zaharia, Tim Hunter & Michael Armbrust - Spark Summit 2017 - Deep Learning and Structured Streaming](http://img.youtube.com/vi/qAZ5XUz32yM/0.jpg)](https://www.youtube.com/watch?v=qAZ5XUz32yM)

Recent videos are archived here (these videos are a great way to have lunch over with your mates!):

 - https://databricks.com/sparkaisummit
   - https://databricks.com/sparkaisummit/sessions

Navigate to the bottom of the next embed and click on video archives link.

What should you be able to do at the end of this course?

by following these sessions and doing some HOMEWORK assignments.

Understand the principles of fault-tolerant scalable computing in Spark

in-memory and generic DAG extensions of Map-reduce
resilient distributed datasets for fault-tolerance
skills to process today's big data using state-of-the art techniques in Apache Spark 2.2, in terms of:
- hands-on coding with real datasets
- an intuitive (non-mathematical) understanding of the ideas behind the technology and methods
- pointers to academic papers in the literature, technical blogs and video streams for you to futher your theoretical understanding.

More concretely, you will be able to:

Part 1: Days 1-4 of training (focused on data engineering)

1.1 Extract, Transform, Load, Interact, Explore and Analyze Data

(watch later) Exploring Apache Web Logs (semi-structured data)

(watch later) Exploring Wikipedia Click Streams (structured data)

1.2 ETL and SQL on Graphs or network data

1.3 Working with Structured Streaming Data

Part 2: Days 5-6 of training

Computational, Mathematical and Statistical Foundations for Data Scientists and Engineers

Here we will use SageMath to get engineers and scientists on the same mathematical page starting from set theory, axiomatic probability theory, statistical decision theory, pseudorandom number generators from first principles, simulation of random variables and random structures including graphs, convergence of random variables, weak law of large numbers, central limit theorem, estimators, and hypothesis tests (parametric and nonparametric) and the principles of statistical learning theory (the mathematics behind machine learning).

Part 3: Days 7-12 of training

3. Build Scalable Machine Learning Pipelines (or help build them)

Apply standard learning methods via scalably servable end-to-end industrial ML pipelines

ETL, Model, Validate, Test, reETL (feature re-engineer), model validate, test,..., serve model to clients

(we will choose from this list for training days 7-12)

Supervised Learning Methods: Regression /Classification
Unsupervised Learning Methods: Clustering
Recommedation systems
Streaming
Graph processing
Geospatial data-processing
Topic modeling
Deep Learning
...

Part 4: Day 13 is Open-Surgery of Production-ready Prototypes/Projects

(watch later) Spark Summit 2015 demo: Creating an end-to-end machine learning data pipeline with Databricks (Live Sentiment Analysis)

(watch later) Spark Summit 2017 - Expanding Apache Spark Use Cases in 2.x and Beyond - Matei Zaharia, Tim Hunter & Michael Armbrust - Deep Learning and Structured Streaming

Recent videos are archived here (these videos are a great way to have lunch over with your mates!):

https://databricks.com/sparkaisummit
- https://databricks.com/sparkaisummit/sessions

Navigate to the bottom of the next embed and click on video archives link.

Show code

%md
## EXTRA: Databases Versus Data Science
#### by Anthony Joseph in BerkeleyX/CS100.1x

* **(watch later 2:31)**: Why all the excitement about *Big Data Analytics*? (using google search to now-cast google flu-trends)
  * [![A Brief History of Data Analysis by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/16wqonWTAsI/0.jpg)](https://www.youtube.com/watch?v=16wqonWTAsI)
* other interesting big data examples - recommender systems and netflix prize?

* **(watch later 10:41)**: Contrasting data science with traditional databases, ML, Scientific computing
  * [![Data Science Database Contrast by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/c7KG0c3ADk0/0.jpg)](https://www.youtube.com/watch?v=c7KG0c3ADk0)
  * SUMMARY:
   * traditional databases versus data science
     * preciousness versus cheapness of the data
     * ACID and eventual consistency, CAP theorem, ...
     * interactive querying: SQL versus noSQL
     * querying the past versus querying/predicting the future
   * traditional scientific computing versus data science
     * science-based or mechanistic models versus data-driven black-box (deep-learning) statistical models (of course both schools co-exist)
     * super-computers in traditional science-based models versus cluster of commodity computers
   * traditional ML versus data science
     * smaller amounts of clean data in traditional ML versus massive amounts of dirty data in data science
     * traditional ML researchers try to publish academic papers versus data scientists try to produce actionable intelligent systems
* **(watch later 1:49)**: Three Approaches to Data Science
  * [![Approaches to Data Science by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/yAOEyeDVn8s/0.jpg)](https://www.youtube.com/watch?v=yAOEyeDVn8s)
* **(watch later 4:29)**:  Performing Data Science and Preparing Data, Data Acquisition and Preparation, ETL, ...
  * [![Data Science Database Contrast by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/3V6ws_VEzaE/0.jpg)](https://www.youtube.com/watch?v=3V6ws_VEzaE)
* **(watch later 2:01)**: Four Examples of Data Science Roles
  * [![Data Science Roles by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/gB-9rdM6W1A/0.jpg)](https://www.youtube.com/watch?v=gB-9rdM6W1A)
  * SUMMARY of Data Science Roles.
   * individual roles:
     1. business person
     2. programmer
   * organizational roles:
     3. enterprise
     4. web company
  * Each role has it own unique set of:
    * data sources
    * Extract-Transform-Load (ETL) process
    * business intelligence and analytics tools
  * Most Maths/Stats/Computing programs cater to the *programmer* role
    * Numpy and Matplotlib, R, Matlab, and Octave.

EXTRA: Databases Versus Data Science

by Anthony Joseph in BerkeleyX/CS100.1x

(watch later 2:31): Why all the excitement about Big Data Analytics? (using google search to now-cast google flu-trends)
other interesting big data examples - recommender systems and netflix prize?
(watch later 10:41): Contrasting data science with traditional databases, ML, Scientific computing
- SUMMARY:
  - traditional databases versus data science
    - preciousness versus cheapness of the data
    - ACID and eventual consistency, CAP theorem, ...
    - interactive querying: SQL versus noSQL
    - querying the past versus querying/predicting the future
  - traditional scientific computing versus data science
    - science-based or mechanistic models versus data-driven black-box (deep-learning) statistical models (of course both schools co-exist)
    - super-computers in traditional science-based models versus cluster of commodity computers
  - traditional ML versus data science
    - smaller amounts of clean data in traditional ML versus massive amounts of dirty data in data science
    - traditional ML researchers try to publish academic papers versus data scientists try to produce actionable intelligent systems
(watch later 1:49): Three Approaches to Data Science
(watch later 4:29): Performing Data Science and Preparing Data, Data Acquisition and Preparation, ETL, ...
(watch later 2:01): Four Examples of Data Science Roles
- SUMMARY of Data Science Roles.
  - individual roles:
    1. business person
    2. programmer
  - organizational roles:
    1. enterprise
    2. web company
- Each role has it own unique set of:
  - data sources
  - Extract-Transform-Load (ETL) process
  - business intelligence and analytics tools
- Most Maths/Stats/Computing programs cater to the programmer role
  - Numpy and Matplotlib, R, Matlab, and Octave.

SDS-2.x, Scalable Data Engineering Science

A bit about your instructor:

What is Scalable Data Science in one picture?

The Scalable Data Engineering Science Process:

Towards Production-Ready and Productisable Prototyping for the Content Factory

key insights

key insights

Data Engineering

The Data Engineering Scientist as "The Middle Way"

Standing on shoulders of giants!

How will you be assessed?

A Brief History of Data Analysis and Where Does "Big Data" Come From?

by Anthony Joseph in BerkeleyX/CS100.1x

Data Science Defined, Cloud Computing and What's Hard About Data Science?

by Anthony Joseph in BerkeleyX/CS100.1x

What should you be able to do at the end of this course?

Understand the principles of fault-tolerant scalable computing in Spark

More concretely, you will be able to:

Part 1: Days 1-4 of training (focused on data engineering)

1.1 Extract, Transform, Load, Interact, Explore and Analyze Data

(watch later) Exploring Apache Web Logs (semi-structured data)

(watch later) Exploring Wikipedia Click Streams (structured data)

1.2 ETL and SQL on Graphs or network data

1.3 Working with Structured Streaming Data

Part 2: Days 5-6 of training

Computational, Mathematical and Statistical Foundations for Data Scientists and Engineers

Part 3: Days 7-12 of training

3. Build Scalable Machine Learning Pipelines (or help build them)

Apply standard learning methods via scalably servable end-to-end industrial ML pipelines

ETL, Model, Validate, Test, reETL (feature re-engineer), model validate, test,..., serve model to clients

(we will choose from this list for training days 7-12)

Part 4: Day 13 is Open-Surgery of Production-ready Prototypes/Projects

(watch later) Spark Summit 2015 demo: Creating an end-to-end machine learning data pipeline with Databricks (Live Sentiment Analysis)

(watch later) Spark Summit 2017 - Expanding Apache Spark Use Cases in 2.x and Beyond - Matei Zaharia, Tim Hunter & Michael Armbrust - Deep Learning and Structured Streaming

EXTRA: Databases Versus Data Science

by Anthony Joseph in BerkeleyX/CS100.1x