ScaDaMaLe Course site and book

Why Apache Spark?

  • Apache Spark: A Unified Engine for Big Data Processing By Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica Communications of the ACM, Vol. 59 No. 11, Pages 56-65 10.1145/2934664

Apache Spark ACM Video

Right-click the above image-link, open in a new tab and watch the video (4 minutes) or read about it in the Communications of the ACM in the frame below or from the link above.

**Key Insights from Apache Spark: A Unified Engine for Big Data Processing **

  • A simple programming model can capture streaming, batch, and interactive workloads and enable new applications that combine them.
  • Apache Spark applications range from finance to scientific data processing and combine libraries for SQL, machine learning, and graphs.
  • In six years, Apache Spark has grown to 1,000 contributors and thousands of deployments.

Key Insights

Spark 3.0 is the latest version now (20200918) and it should be seen as the latest step in the evolution of tools in the big data ecosystem as summarized in https://towardsdatascience.com/what-is-big-data-understanding-the-history-32078f3b53ce:

Spark in context

Alternatives to Apache Spark

There are several alternatives to Apache Spark, but none of them have the penetration and community of Spark as of 2021.

For real-time streaming operations Apache Flink is competitive. See Apache Flink vs Spark – Will one overtake the other? for a July 2021 comparison. Most scalable data science and engineering problems faced by several major industries in Sweden today are routinely solved using tools in the ecosystem around Apache Spark. Therefore, we will focus on Apache Spark here which still holds the world record for 10TB or 10,000 GB sort by Alibaba cloud in 06/17/2020.

The big data problem

Hardware, distributing work, handling failed and slow machines

Let us recall and appreciate the following:

  • The Big Data Problem
    • Many routine problems today involve dealing with "big data", operationally, this is a dataset that is larger than a few TBs and thus won't fit into a single commodity computer like a powerful desktop or laptop computer.
  • Hardware for Big Data
  • The best single commodity computer can not handle big data as it has limited hard-disk and memory
  • Thus, we need to break the data up into lots of commodity computers that are networked together via cables to communicate instructions and data between them - this can be thought of as a cloud
  • How to distribute work across a cluster of commodity machines?
    • We need a software-level framework for this.
  • How to deal with failures or slow machines?
    • We also need a software-level framework for this.

Key Papers

MapReduce and Apache Spark.

MapReduce as we will see shortly in action is a framework for distributed fault-tolerant computing over a fault-tolerant distributed file-system, such as Google File System or open-source Hadoop for storage.

  • Unfortunately, Map Reduce is bounded by Disk I/O and can be slow
    • especially when doing a sequence of MapReduce operations requirinr multiple Disk I/O operations
  • Apache Spark can use Memory instead of Disk to speed-up MapReduce Operations
    • Spark Versus MapReduce - the speed-up is orders of magnitude faster
  • SUMMARY
    • Spark uses memory instead of disk alone and is thus fater than Hadoop MapReduce
    • Spark's resilience abstraction is by RDD (resilient distributed dataset)
    • RDDs can be recovered upon failures from their lineage graphs, the recipes to make them starting from raw data
    • Spark supports a lot more than MapReduce, including streaming, interactive in-memory querying, etc.
    • Spark demonstrated an unprecedented sort of 1 petabyte (1,000 terabytes) worth of data in 234 minutes running on 190 Amazon EC2 instances (in 2015).
    • Spark expertise corresponds to the highest Median Salary in the US (~ 150K)


Next let us get everyone to login to databricks (or another Spark platform) to get our hands dirty with some Spark code!