SDS-3.x: Scalable Data Science and Distributed Machine Learning

000_2-sds-3-x-ml: Deeper Dive into Distributed Machine Learning

Topics: Distributed Simulation; various un/supervised ML Algorithms; Linear Algebra; Vertex Programming using SparkML, GraphX and piped-RDDs.

1. Creating packages within notebooks

2. Introduction to Distributed Simulation and Machine Learning

3. Unsupervised Learning - Clustering, K-Means of 1 Million Songs

4. Supervised Learning - Clustering, Decision Trees and Hand-written Digit Recognition

5. Linear Algebra for Distributed Machine Learning

6. Supervised Learning - Regression and Random Forests

7. Distributed Vertex Programming, ETL and Graph Querying with GraphX and GraphFrames

8. Old Bailey Online - ETL of XML

9. Piped RDDs - Rigorous Bayesian AB Testing on Old Bailey Online Data

10. Latent Dirichlet Allocation of NewsGroups and Cornell Movie Dialogs

11. Collaborative Filtering for Recommendation Systems

12. Extending built-in functions in GraphX

13. Fraud Detection with Decision Trees