- To get an overview of sketching in under 30 minutes see this **Spark-AI 2020 Talk on Apache Sketches by Lee RHodes*:
- Other great recent talks:
- Yeshwant Vijaykumar of Adobe Inc at Spark+AI Summit 2020:
- Simeon Simeonov of Swoop at Spark+AI Summit 2020:
Sketch Origins
READ: Philippe Flajolet and Nigel Martin (1985), Probabilistic Counting Algorithms for Data Base Applicaitons, http://db.cs.berkeley.edu/cs286/papers/flajoletmartin-jcss1985.pdf.
Apache Sketches
-
Deep dive into the following three links at https://datasketches.apache.org/:
- Architecture And Design,
- Sketch Families and
- System Integrations
- Community
- Research
-
Sketcheing in Apache Spark: https://datasketches.apache.org/docs/Theta/ThetaSparkExample.html
-
See docs:
Some general recent talks/blogs on various sketches:
- https://databricks.com/session_na20/high-performance-analytics-with-probabilistic-data-structures-the-power-of-hyperloglog
- https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
- the above has a databricks notebook you can try to self-study
We will next focus on a specific sketch called T-Digest for approximating extreme quantiles: - https://databricks.com/session/one-pass-data-science-in-apache-spark-with-generative-t-digests - https://databricks.com/session/sketching-data-with-t-digest-in-apache-spark