010_wikipediaClickStream_01ETLEDA(Scala)

Loading...

ScaDaMaLe Course site and book

Wiki Clickstream Analysis

Dataset: 3.2 billion requests collected during the month of February 2015 grouped by (src, dest)

Source: https://datahub.io/dataset/wikipedia-clickstream/

NY clickstream image

This notebook requires Spark 1.6+.

This notebook was originally a data analysis workflow developed with Databricks Community Edition, a free version of Databricks designed for learning Apache Spark.

Here we elucidate the original python notebook (also linked here) used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088 (watch later)

Michael Armbrust Spark Summit East

Data set

Wikipedia Logo

The data we are exploring in this lab is the February 2015 English Wikipedia Clickstream data, and it is available here: http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82.

According to Wikimedia:

"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015."

The data is approximately 1.2GB and it is hosted in the following Databricks file: /databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed

display(dbutils.fs.ls("/databricks-datasets/wikipedia-datasets/"))
 
path
name
size
1
dbfs:/databricks-datasets/wikipedia-datasets/data-001/
data-001/
0

Showing all 1 rows.

Let us first understand this Wikimedia data set a bit more

Let's read the datahub-hosted link https://datahub.io/dataset/wikipedia-clickstream in the embedding below. Also click the blog by Ellery Wulczyn, Data Scientist at The Wikimedia Foundation, to better understand how the data was generated (remember to Right-Click and use -> and <- if navigating within the embedded html frame below).

Show code

Run the next two cells for some housekeeping.

if (org.apache.spark.BuildInfo.sparkBranch < "1.6") sys.error("Attach this notebook to a cluster running Spark 1.6+")

Loading and Exploring the data

val data = sc.textFile("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")
data: org.apache.spark.rdd.RDD[String] = dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed MapPartitionsRDD[240] at textFile at command-685894176423189:1
Looking at the first few lines of the data
data.take(5).foreach(println) 
prev_id curr_id n prev_title curr_title type 3632887 121 other-google !! other 3632887 93 other-wikipedia !! other 3632887 46 other-empty !! other 3632887 10 other-other !! other
data.take(2)
res4: Array[String] = Array(prev_id curr_id n prev_title curr_title type, " 3632887 121 other-google !! other")
  • The first line looks like a header
  • The second line (separated from the first by ",") contains data organized according to the header, i.e., prev_id = 3632887, curr_id = 121", and so on.

Actually, here is the meaning of each column:

  • prev_id: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on

  • curr_id: the MediaWiki unique page ID of the article the client requested

  • prev_title: the result of mapping the referer URL to the fixed set of values described below

  • curr_title: the title of the article the client requested

  • n: the number of occurrences of the (referer, resource) pair

  • type

    • "link" if the referer and request are both articles and the referer links to the request
    • "redlink" if the referer is an article and links to the request, but the request is not in the production enwiki.page table
    • "other" if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer