SDS-2.2, Scalable data science from Atlantis, is a technical course in the area of Big Data, aimed at the needs of Stockholm’s data industry. It is an updated version of SDS-1.6, Scalable Data Science from Middle Earth that was aimed at the needs of New Zealands’s data industry.

SDS-2.2 uses Apache Spark 2.2, a fast and general engine for large-scale data processing via databricks to compute with datasets that won’t fit in a single computer. SDS-1.6 used Spark version 1.6.

The course will introduce Spark’s core concepts via hands-on coding, including resilient distributed datasets and map-reduce algorithms, DataFrame and Spark SQL on Catalyst, scalable machine-learning pipelines in MlLib and vertex programs using the distributed graph processing framework of GraphX. We will solve instances of real-world big data decision problems from various scientific domains.

This is being prepared by Raazesh Sainudiin with assistance from Tilo Wiklund and Dan Strangberg.