ScaDaMaLe Course site and book

Sketch Origins

READ: Philippe Flajolet and Nigel Martin (1985), Probabilistic Counting Algorithms for Data Base Applicaitons, http://db.cs.berkeley.edu/cs286/papers/flajoletmartin-jcss1985.pdf.

Apache Sketches

Some general recent talks/blogs on various sketches:

  • https://databricks.com/session_na20/high-performance-analytics-with-probabilistic-data-structures-the-power-of-hyperloglog
  • https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
    • the above has a databricks notebook you can try to self-study

We will next focus on a specific sketch called T-Digest for approximating extreme quantiles: - https://databricks.com/session/one-pass-data-science-in-apache-spark-with-generative-t-digests - https://databricks.com/session/sketching-data-with-t-digest-in-apache-spark