%md ### A bit about your instructor: I, Raazesh Sainudiin or **Raaz**, will be your instructor for the course in data science. I have * more than 14 years of academic research experience in applied mathematics and statistics and * nearly 2 to 4 years of full-time to part-time experience in the data industry. I currently (2019) have an effective joint appointment as: * [Associate Professor of Mathematics with specialisation in Data Science](http://katalog.uu.se/profile/?id=N17-214) at [Department of Mathematics](http://www.math.uu.se/), [Uppsala University](http://www.uu.se/), Uppsala, Sweden and * Principal Data Scientist at [Combient AB](https://combient.com/), Stockholm, Sweden Quick links on Raaz's background: * [https://nz.linkedin.com/in/raazesh-sainudiin-45955845](https://nz.linkedin.com/in/raazesh-sainudiin-45955845) * [Raaz's academic CV](https://lamastex.github.io/cv/)
A bit about your instructor:
I, Raazesh Sainudiin or Raaz, will be your instructor for the course in data science. I have
- more than 14 years of academic research experience in applied mathematics and statistics and
- nearly 2 to 4 years of full-time to part-time experience in the data industry.
I currently (2019) have an effective joint appointment as:
- Associate Professor of Mathematics with specialisation in Data Science at Department of Mathematics, Uppsala University, Uppsala, Sweden and
- Principal Data Scientist at Combient AB, Stockholm, Sweden
Quick links on Raaz's background:
Last refresh: Never
%md # What is Scalable [Data Science](https://en.wikipedia.org/wiki/Data_science) in one picture?  --- # The Scalable Data Engineering Science Process: ## Towards Production-Ready and Productisable Prototyping for the Content Factory This can be summarised in [Andrew Morgan](https://www.linkedin.com/in/andrew-morgan-8590b22/)'s image of the **Content Factory**: 
What is Scalable Data Science in one picture?
The Scalable Data Engineering Science Process:
Towards Production-Ready and Productisable Prototyping for the Content Factory
This can be summarised in Andrew Morgan's image of the Content Factory:
Last refresh: Never
%md Source: [Vasant Dhar, Data Science and Prediction, Communications of the ACM, Vol. 56 (1). p. 64, DOI:10.1145/2500499](http://dl.acm.org/citation.cfm?id=2500499) ### key insights * Data Science is the study of *the generalizabile extraction of knowledge from data*. * A common epistemic requirement in assessing whether new knowledge is actionable for decision making is its predictive power, not just its ability to explain the past. * A *data scientist requires an integrated skill set spanning* * mathematics, * machine learning, * artificial intelligence, * statistics, * databases, and * optimization, * along with a deep understanding of the craft of problem formulation to engineer effective solutions. Source: [Machine learning: Trends, perspectives, and prospects, M. I. Jordan, T. M. Mitchell, Science 17 Jul 2015: Vol. 349, Issue 6245, pp. 255-260, DOI: 10.1126/science.aaa8415](http://science.sciencemag.org/content/349/6245/255.full-text.pdf+html) ### key insights * ML is concerned with the building of computers that improve automatically through experience * ML lies at the intersection of computer science and statistics and at the core of artificial intelligence and data science * Recent progress in ML is due to: * development of new algorithms and theory * ongoing explosion in the availability of online data * availability of low-cost computation (*through clusters of commodity hardware in the *cloud* ) * The adoption of data science and ML methods is leading to more evidence-based decision-making across: * health sciences (neuroscience research, ) * manufacturing * robotics (autonomous vehicle) * vision, speech processing, natural language processing * education * financial modeling * policing * marketing
key insights
- Data Science is the study of the generalizabile extraction of knowledge from data.
- A common epistemic requirement in assessing whether new knowledge is actionable for decision making is its predictive power, not just its ability to explain the past.
- A data scientist requires an integrated skill set spanning
- mathematics,
- machine learning,
- artificial intelligence,
- statistics,
- databases, and
- optimization,
- along with a deep understanding of the craft of problem formulation to engineer effective solutions.
key insights
- ML is concerned with the building of computers that improve automatically through experience
- ML lies at the intersection of computer science and statistics and at the core of artificial intelligence and data science
- Recent progress in ML is due to:
- development of new algorithms and theory
- ongoing explosion in the availability of online data
- availability of low-cost computation (through clusters of commodity hardware in the cloud* )
- The adoption of data science and ML methods is leading to more evidence-based decision-making across:
- health sciences (neuroscience research, )
- manufacturing
- robotics (autonomous vehicle)
- vision, speech processing, natural language processing
- education
- financial modeling
- policing
- marketing
Last refresh: Never
//This allows easy embedding of publicly available information into any other notebook //Example usage: // displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250)) def frameIt( u:String, h:Int ) : String = { """<iframe src=""""+ u+"""" width="95%" height="""" + h + """" sandbox> <p> <a href="http://spark.apache.org/docs/latest/index.html"> Fallback link for browsers that, unlikely, don't support frames </a> </p> </iframe>""" } displayHTML(frameIt("https://en.wikipedia.org/wiki/Data_science",500))
Last refresh: Never
%md # Data Engineering There are several views on what a data engineer is supposed to do: Some views are rather narrow and emphasise division of labour between data engineers and data scientists: - https://www.oreilly.com/ideas/data-engineering-a-quick-and-simple-definition - Let's check out what skills a data engineer is expected to have according to the link above. > "Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more." > What skills do data engineers need? > Those “10-30 different big data technologies” Anderson references in “Data engineers vs. data scientists” can fall under numerous areas, such as file formats, > ingestion engines, stream processing, batch processing, batch SQL, data storage, cluster management, transaction databases, web frameworks, data visualizations, and machine learning. And that’s just the tip of the iceberg. > Buss says data engineers should have the following skills and knowledge: > - They need to know Linux and they should be comfortable using the command line. > - They should have experience programming in at least Python or Scala/Java. > - They need to know SQL. > - They need some understanding of distributed systems in general and how they are different from traditional storage and processing systems. > - They need a deep understanding of the ecosystem, including ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g. Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu). They should know the strengths and weaknesses of each tool and what it's best used for. > - They need to know how to access and process data. Let's dive deeper into such highly compartmentalised views of data engineers and data scientists and the so-called "machine learning engineers" according the following view: - https://www.oreilly.com/ideas/data-engineers-vs-data-scientists embedded below.
Data Engineering
There are several views on what a data engineer is supposed to do:
Some views are rather narrow and emphasise division of labour between data engineers and data scientists:
- https://www.oreilly.com/ideas/data-engineering-a-quick-and-simple-definition
- Let's check out what skills a data engineer is expected to have according to the link above.
"Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more."
What skills do data engineers need? Those “10-30 different big data technologies” Anderson references in “Data engineers vs. data scientists” can fall under numerous areas, such as file formats, > ingestion engines, stream processing, batch processing, batch SQL, data storage, cluster management, transaction databases, web frameworks, data visualizations, and machine learning. And that’s just the tip of the iceberg.
Buss says data engineers should have the following skills and knowledge:
- They need to know Linux and they should be comfortable using the command line.
- They should have experience programming in at least Python or Scala/Java.
- They need to know SQL.
- They need some understanding of distributed systems in general and how they are different from traditional storage and processing systems.
- They need a deep understanding of the ecosystem, including ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g. Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu). They should know the strengths and weaknesses of each tool and what it's best used for.
- They need to know how to access and process data.
Let's dive deeper into such highly compartmentalised views of data engineers and data scientists and the so-called "machine learning engineers" according the following view:
embedded below.
Last refresh: Never
displayHTML(frameIt("https://www.oreilly.com/ideas/data-engineers-vs-data-scientists",500))
Last refresh: Never
%md # The Data Engineering Scientist as "The Middle Way" Here are some basic axioms that should be self-evident. - Yes, there are differences in skillsets across humans - some humans will be better and have inclinations for engineering and others for pure mathematics by nature and nurture - one human cannot easily be a master of everything needed for innovating a new data-based product or service (very very rarely) - Skills can be gained by any human who wants to learn to the extent s/he is able to expend time, energy, etc. For the **Scalable Data Engineering Science Process:** *towards Production-Ready and Productisable Prototyping for the Data Factory* we need to allow each data engineer to be more of a data scientist and each data scientist to be more of a data engineer, up to each individual's *comfort zones* in technical and mathematical/conceptual and time-availability planes, but with some **minimal expectations** of mutual appreciation. This course is designed to help you take the first minimal steps towards **data engineering science**. In the sequel it will become apparent **why a team of data engineering scientists** with skills across the conventional (2019) spectrum of data engineer versus data scientist **is crucial** for **Production-Ready and Productisable Prototyping for the Data Factory**.
The Data Engineering Scientist as "The Middle Way"
Here are some basic axioms that should be self-evident.
- Yes, there are differences in skillsets across humans
- some humans will be better and have inclinations for engineering and others for pure mathematics by nature and nurture
- one human cannot easily be a master of everything needed for innovating a new data-based product or service (very very rarely)
- Skills can be gained by any human who wants to learn to the extent s/he is able to expend time, energy, etc.
For the Scalable Data Engineering Science Process: towards Production-Ready and Productisable Prototyping for the Data Factory we need to allow each data engineer to be more of a data scientist and each data scientist to be more of a data engineer, up to each individual's comfort zones in technical and mathematical/conceptual and time-availability planes, but with some minimal expectations of mutual appreciation.
This course is designed to help you take the first minimal steps towards data engineering science.
In the sequel it will become apparent why a team of data engineering scientists with skills across the conventional (2019) spectrum of data engineer versus data scientist is crucial for Production-Ready and Productisable Prototyping for the Data Factory.
Last refresh: Never
%md ## Standing on shoulders of giants! This course will build on two other edX courses where needed. * [BerkeleyX/CS100-1x, Introduction to Big Data Using Apache Spark by Anthony A Joseph, Chancellor's Professor, UC Berkeley](https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x) * [BerkeleyX/CS190-1x, Scalable Machine Learning by Ameet Talwalkar, Ass. Prof., UC Los Angeles](https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x) We encourage you to take these courses if you have more time. For those of you (including the course coordinator) who have taken these courses formally in 2015, this course will be an *expanded scala version* with an emphasis on *individualized course project* as opposed to completing labs that test sytactic skills. We will also be borrowing more theoretical aspects from the following course: * [Stanford/CME323, Distributed Algorithms and Optimization by Reza Zadeh, Ass. Prof., Institute for Computational and Mathematical Engineering, Stanford Univ.](http://stanford.edu/~rezab/dao/) The first two recommended readings below are (already somewhat outdated!), the third one is advanced but current now and the fourth one is in progress: * Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, 2015. * Advanced analytics with Spark : patterns for learning from data at scale, Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, O'Reilly, 2015. * High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark, Holden Karau, Rachel Warren, O'Reilly, 2017. * Mastering Spark for Data Science By Andrew Morgan, Antoine Amend, David George, Matthew Hallett, Packt Publishing, 2017. * Spark: The Definitive Guide, Big Data Processing Made Simple By Matei Zaharia, Bill Chambers, O'Reilly Media, 2018. <img src="http://www.syndetics.com/index.aspx?type=xw12&isbn=9781449358624/LC.GIF&client=ucanterburyl&upc=&oclc=" height="150" width="150"> <img src="http://t3.gstatic.com/images?q=tbn:ANd9GcSQs35NvHVozz77dhXYc2Ce8lKyJkR3oVwaxyA5Ub4W7Kvtvf9i" height="150" width="150"> <img src="http://t2.gstatic.com/images?q=tbn:ANd9GcS7XN41_u0B8XehDmtXLJeuEPgnuULz16oFMRoANYz2e1-Vog3D" height="150" width="150"> <img src="https://images-na.ssl-images-amazon.com/images/I/51wFajfF5oL._SX258_BO1,204,203,200_.jpg" height="150" width="150"> <img src="https://covers.oreillystatic.com/images/0636920034957/lrg.jpg" height="150" width="150">
Standing on shoulders of giants!
This course will build on two other edX courses where needed.
- BerkeleyX/CS100-1x, Introduction to Big Data Using Apache Spark by Anthony A Joseph, Chancellor's Professor, UC Berkeley
- BerkeleyX/CS190-1x, Scalable Machine Learning by Ameet Talwalkar, Ass. Prof., UC Los Angeles
We encourage you to take these courses if you have more time. For those of you (including the course coordinator) who have taken these courses formally in 2015, this course will be an expanded scala version with an emphasis on individualized course project as opposed to completing labs that test sytactic skills.
We will also be borrowing more theoretical aspects from the following course:
The first two recommended readings below are (already somewhat outdated!), the third one is advanced but current now and the fourth one is in progress:
- Learning Spark : lightning-fast data analytics by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, 2015.
- Advanced analytics with Spark : patterns for learning from data at scale, Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, O'Reilly, 2015.
- High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark, Holden Karau, Rachel Warren, O'Reilly, 2017.
- Mastering Spark for Data Science By Andrew Morgan, Antoine Amend, David George, Matthew Hallett, Packt Publishing, 2017.
- Spark: The Definitive Guide, Big Data Processing Made Simple By Matei Zaharia, Bill Chambers, O'Reilly Media, 2018.
Last refresh: Never
%md # How will you be assessed? There will be minimal exercises and an open mini-project you can do and present briefly to us. You will also be working on a bigger project in a small team that this training will prepare you for. This project can be hopefully turned into an asset.
How will you be assessed?
There will be minimal exercises and an open mini-project you can do and present briefly to us.
You will also be working on a bigger project in a small team that this training will prepare you for. This project can be hopefully turned into an asset.
Last refresh: Never
%md ## A Brief History of Data Analysis and Where Does "Big Data" Come From? #### by Anthony Joseph in BerkeleyX/CS100.1x * **(watch now 1:53):** A Brief History of Data Analysis * [](https://www.youtube.com/watch?v=5fSSvYlDkag) * **(watch now 5:05)**: Where does Data Come From? * [](https://www.youtube.com/watch?v=eEJFlHE7Gt4?rel=0&autoplay=1&modestbranding=1) * SUMMARY of Some of the sources of big data. * online click-streams (a lot of it is recorded but a tiny amount is analyzed): * record every click * every ad you view * every billing event, * every transaction, every network message, and every fault. * User-generated content (on web and mobile devices): * every post that you make on Facebook * every picture sent on Instagram * every review you write for Yelp or TripAdvisor * every tweet you send on Twitter * every video that you post to YouTube. * Science (for scientific computing): * data from various repositories for natural language processing: * Wikipedia, * the Library of Congress, * twitter firehose and google ngrams and digital archives, * data from scientific instruments/sensors/computers: * the Large Hadron Collider (more data in a year than all the other data sources combined!) * genome sequencing data (sequencing cost is dropping much faster than Moore's Law!) * output of high-performance computers (super-computers) for data fusion, estimation/prediction and exploratory data analysis * Graphs are also an interesting source of big data (*network science*). * social networks (collaborations, followers, fb-friends or other relationships), * telecommunication networks, * computer networks, * road networks * machine logs: * by servers around the internet (hundreds of millions of machines out there!) * internet of things.
A Brief History of Data Analysis and Where Does "Big Data" Come From?
by Anthony Joseph in BerkeleyX/CS100.1x
- (watch now 5:05): Where does Data Come From?
- SUMMARY of Some of the sources of big data.
- online click-streams (a lot of it is recorded but a tiny amount is analyzed):
- record every click
- every ad you view
- every billing event,
- every transaction, every network message, and every fault.
- User-generated content (on web and mobile devices):
- every post that you make on Facebook
- every picture sent on Instagram
- every review you write for Yelp or TripAdvisor
- every tweet you send on Twitter
- every video that you post to YouTube.
- Science (for scientific computing):
- data from various repositories for natural language processing:
- Wikipedia,
- the Library of Congress,
- twitter firehose and google ngrams and digital archives,
- data from scientific instruments/sensors/computers:
- the Large Hadron Collider (more data in a year than all the other data sources combined!)
- genome sequencing data (sequencing cost is dropping much faster than Moore's Law!)
- output of high-performance computers (super-computers) for data fusion, estimation/prediction and exploratory data analysis
- Graphs are also an interesting source of big data (network science).
- social networks (collaborations, followers, fb-friends or other relationships),
- telecommunication networks,
- computer networks,
- road networks
- machine logs:
- by servers around the internet (hundreds of millions of machines out there!)
- internet of things.
- data from various repositories for natural language processing:
- online click-streams (a lot of it is recorded but a tiny amount is analyzed):
Last refresh: Never
%md ## Data Science Defined, Cloud Computing and What's Hard About Data Science? #### by Anthony Joseph in BerkeleyX/CS100.1x * **(watch now 2:03)**: Data Science Defined * [](https://www.youtube.com/watch?v=g4ujW1m2QNc?rel=0&modestbranding=1) * **(watch now 1:11)**: Cloud Computing * [](https://www.youtube.com/watch?v=TAZvh0WmOHM?rel=0&modestbranding=1) * In fact, if you are logged into `https://*.databricks.com/*` you are computing in the cloud! * The Scalable Data Science course is supported by Databricks Academic Partners Program and the AWS Educate Grant to University of Canterbury (applied by Raaz Sainudiin in 2015). * **(watch now 3:31)**: What's hard about data science * [](https://www.youtube.com/watch?v=MIqbwJ6AbIY?rel=0&modestbranding=1)
Data Science Defined, Cloud Computing and What's Hard About Data Science?
by Anthony Joseph in BerkeleyX/CS100.1x
- (watch now 2:03): Data Science Defined
- (watch now 1:11): Cloud Computing
- (watch now 3:31): What's hard about data science
Last refresh: Never
%md # What should *you* be able to do at the end of this course? * by following these sessions and doing some HOMEWORK assignments. ## Understand the principles of fault-tolerant scalable computing in Spark * in-memory and generic DAG extensions of Map-reduce * resilient distributed datasets for fault-tolerance * skills to process today's big data using state-of-the art techniques in Apache Spark 2.2, in terms of: * hands-on coding with real datasets * an intuitive (non-mathematical) understanding of the ideas behind the technology and methods * pointers to academic papers in the literature, technical blogs and video streams for *you to futher your theoretical understanding*. # More concretely, you will be able to: ## Part 1: Days 1-4 of training (focused on data engineering) ### 1.1 Extract, Transform, Load, Interact, Explore and Analyze Data #### (watch later) Exploring Apache Web Logs (semi-structured data) [](https://vimeo.com/137874931) #### (watch later) Exploring Wikipedia Click Streams (structured data) [](https://www.youtube.com/watch?v=35Y-rqSMCCA) #### 1.2 ETL and SQL on Graphs or network data #### 1.3 Working with Structured Streaming Data ## Part 2: Days 5-6 of training ### Computational, Mathematical and Statistical Foundations for Data Scientists and Engineers Here we will use [SageMath](http://www.sagemath.org/) to get engineers and scientists on the same mathematical page starting from set theory, axiomatic probability theory, statistical decision theory, pseudorandom number generators from first principles, simulation of random variables and random structures including graphs, convergence of random variables, weak law of large numbers, central limit theorem, estimators, and hypothesis tests (parametric and nonparametric) and the principles of statistical learning theory (the mathematics behind machine learning). ## Part 3: Days 7-12 of training ### 3. Build Scalable Machine Learning Pipelines (or help build them) ### Apply standard learning methods via scalably servable *end-to-end industrial ML pipelines* #### ETL, Model, Validate, Test, reETL (feature re-engineer), model validate, test,..., serve model to clients ##### (we will choose from this list for training days 7-12) * Supervised Learning Methods: Regression /Classification * Unsupervised Learning Methods: Clustering * Recommedation systems * Streaming * Graph processing * Geospatial data-processing * Topic modeling * Deep Learning * ... ## Part 4: Day 13 is Open-Surgery of Production-ready Prototypes/Projects --- --- #### (watch later) Spark Summit 2015 demo: Creating an end-to-end machine learning data pipeline with Databricks (Live Sentiment Analysis) [](https://www.youtube.com/watch?v=NR1MYg_7oSg) #### (watch later) Spark Summit 2017 - Expanding Apache Spark Use Cases in 2.x and Beyond - Matei Zaharia, Tim Hunter & Michael Armbrust - Deep Learning and Structured Streaming [](https://www.youtube.com/watch?v=qAZ5XUz32yM) Recent videos are archived here (these videos are a great way to have lunch over with your mates!): - https://databricks.com/sparkaisummit - https://databricks.com/sparkaisummit/sessions Navigate to the bottom of the next embed and click on video archives link.
What should you be able to do at the end of this course?
- by following these sessions and doing some HOMEWORK assignments.
Understand the principles of fault-tolerant scalable computing in Spark
- in-memory and generic DAG extensions of Map-reduce
- resilient distributed datasets for fault-tolerance
- skills to process today's big data using state-of-the art techniques in Apache Spark 2.2, in terms of:
- hands-on coding with real datasets
- an intuitive (non-mathematical) understanding of the ideas behind the technology and methods
- pointers to academic papers in the literature, technical blogs and video streams for you to futher your theoretical understanding.
More concretely, you will be able to:
Part 1: Days 1-4 of training (focused on data engineering)
1.1 Extract, Transform, Load, Interact, Explore and Analyze Data
(watch later) Exploring Apache Web Logs (semi-structured data)
(watch later) Exploring Wikipedia Click Streams (structured data)
1.2 ETL and SQL on Graphs or network data
1.3 Working with Structured Streaming Data
Part 2: Days 5-6 of training
Computational, Mathematical and Statistical Foundations for Data Scientists and Engineers
Here we will use SageMath to get engineers and scientists on the same mathematical page starting from set theory, axiomatic probability theory, statistical decision theory, pseudorandom number generators from first principles, simulation of random variables and random structures including graphs, convergence of random variables, weak law of large numbers, central limit theorem, estimators, and hypothesis tests (parametric and nonparametric) and the principles of statistical learning theory (the mathematics behind machine learning).
Part 3: Days 7-12 of training
3. Build Scalable Machine Learning Pipelines (or help build them)
Apply standard learning methods via scalably servable end-to-end industrial ML pipelines
ETL, Model, Validate, Test, reETL (feature re-engineer), model validate, test,..., serve model to clients
(we will choose from this list for training days 7-12)
- Supervised Learning Methods: Regression /Classification
- Unsupervised Learning Methods: Clustering
- Recommedation systems
- Streaming
- Graph processing
- Geospatial data-processing
- Topic modeling
- Deep Learning
- ...
Part 4: Day 13 is Open-Surgery of Production-ready Prototypes/Projects
(watch later) Spark Summit 2015 demo: Creating an end-to-end machine learning data pipeline with Databricks (Live Sentiment Analysis)
(watch later) Spark Summit 2017 - Expanding Apache Spark Use Cases in 2.x and Beyond - Matei Zaharia, Tim Hunter & Michael Armbrust - Deep Learning and Structured Streaming
Recent videos are archived here (these videos are a great way to have lunch over with your mates!):
Navigate to the bottom of the next embed and click on video archives link.
Last refresh: Never
displayHTML(frameIt("https://databricks.com/sparkaisummit/sessions",500))
Last refresh: Never
SDS-2.x, Scalable Data Engineering Science
Last refresh: Never