010_wikipediaClickStream_01ETLEDA(Scala)

Wiki Clickstream Analysis

Dataset: 3.2 billion requests collected during the month of February 2015 grouped by (src, dest)

Source: https://datahub.io/dataset/wikipedia-clickstream/

NY clickstream image

This notebook requires Spark 1.6+.

This notebook was originally a data analysis workflow developed with Databricks Community Edition, a free version of Databricks designed for learning Apache Spark.

Here we elucidate the original python notebook (also linked here) used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088 (watch later)

Michael Armbrust Spark Summit East

Data set

Wikipedia Logo

The data we are exploring in this lab is the February 2015 English Wikipedia Clickstream data, and it is available here: http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82.

According to Wikimedia:

"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015."

The data is approximately 1.2GB and it is hosted in the following Databricks file: /databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed

Let us first understand this Wikimedia data set a bit more

Let's read the datahub-hosted link https://datahub.io/dataset/wikipedia-clickstream in the embedding below. Also click the blog by Ellery Wulczyn, Data Scientist at The Wikimedia Foundation, to better understand how the data was generated (remember to Right-Click and use -> and <- if navigating within the embedded html frame below).

Run the next two cells for some housekeeping.

if (org.apache.spark.BuildInfo.sparkBranch < "1.6") sys.error("Attach this notebook to a cluster running Spark 1.6+")

Loading and Exploring the data

val data = sc.textFile("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")

data: org.apache.spark.rdd.RDD[String] = dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed MapPartitionsRDD[1] at textFile at <console>:34
Looking at the first few lines of the data
data.take(5).foreach(println) 
prev_id curr_id n prev_title curr_title type 3632887 121 other-google !! other 3632887 93 other-wikipedia !! other 3632887 46 other-empty !! other 3632887 10 other-other !! other
data.take(2)
res3: Array[String] = Array(prev_id curr_id n prev_title curr_title type, " 3632887 121 other-google !! other")
  • The first line looks like a header
  • The second line (separated from the first by ",") contains data organized according to the header, i.e., prev_id = 3632887, curr_id = 121", and so on.

Actually, here is the meaning of each column:

  • prev_id: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on

  • curr_id: the MediaWiki unique page ID of the article the client requested

  • prev_title: the result of mapping the referer URL to the fixed set of values described below

  • curr_title: the title of the article the client requested

  • n: the number of occurrences of the (referer, resource) pair

  • type

    • "link" if the referer and request are both articles and the referer links to the request
    • "redlink" if the referer is an article and links to the request, but the request is not in the production enwiki.page table
    • "other" if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer

Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources to English Wikipedia, based on this scheme:

  • an article in the main namespace of English Wikipedia -> the article title
  • any Wikipedia page that is not in the main namespace of English Wikipedia -> other-wikipedia
  • an empty referer -> other-empty
  • a page from any other Wikimedia project -> other-internal
  • Google -> other-google
  • Yahoo -> other-yahoo
  • Bing -> other-bing
  • Facebook -> other-facebook
  • Twitter -> other-twitter
  • anything else -> other-other

In the second line of the file above, we can see there were 121 clicks from Google to the Wikipedia page on "!!" (double exclamation marks). People search for everything!

  • prev_id = (nothing)
  • curr_id = 3632887 --> (Wikipedia page ID)
  • n = 121 (People clicked from Google to this page 121 times in this month.)
  • prev_title = other-google (This data record is for referals from Google.)
  • curr_title = !! (This Wikipedia page is about a double exclamation mark.)
  • type = other

Create a DataFrame from this CSV

  • From the next Spark release - 2.0, CSV as a datasource will be part of Spark's standard release. But, we are using Spark 1.6
// Load the raw dataset stored as a CSV file
val clickstream = sqlContext.
    read.
    format("com.databricks.spark.csv").
    options(Map("header" -> "true", "delimiter" -> "\t", "mode" -> "PERMISSIVE", "inferSchema" -> "true")).
    load("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")
  
clickstream: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]
Print the schema
clickstream.printSchema
root |-- prev_id: integer (nullable = true) |-- curr_id: integer (nullable = true) |-- n: integer (nullable = true) |-- prev_title: string (nullable = true) |-- curr_title: string (nullable = true) |-- type: string (nullable = true)

Display some sample data

display(clickstream)
null3632887121other-google!!other
null363288793other-wikipedia!!other
null363288746other-empty!!other
null363288710other-other!!other
64486363288711!_(disambiguation)!!other
2061699255696219Louden_Up_Now!!!_(album)link
null255696225other-empty!!!_(album)other
null255696216other-google!!!_(album)other
null255696244other-wikipedia!!!_(album)other
64486255696215!_(disambiguation)!!!_(album)link
6007442556962297!!!!!!_(album)link
null689331011other-empty!Hero_(album)other
1921683689331026!Hero!Hero_(album)link
null689331016other-wikipedia!Hero_(album)other
null689331023other-google!Hero_(album)other
81273042260247316Jericho_Rosales!Oka_Tokatlink
359788742260247320List_of_telenovelas_of_ABS-CBN!Oka_Tokatlink
null2260247357other-google!Oka_Tokatother
null2260247312other-wikipedia!Oka_Tokatother
null2260247323other-empty!Oka_Tokatother
73606872260247310Rica_Peralejo!Oka_Tokatlink
371045822260247311Jeepney_TV!Oka_Tokatlink
343765902260247322Oka_Tokat_(2012_TV_series)!Oka_Tokatlink
null681076820other-wikipedia!T.O.O.H.!other
null681076881other-google!T.O.O.H.!other
31976181681076851List_of_death_metal_bands,_!–K!T.O.O.H.!link
null681076835other-empty!T.O.O.H.!other
null324304721other-empty!_(album)other
13374753243047208The_Dismemberment_Plan!_(album)link
3284285324304778The_Dismemberment_Plan_Is_Terrified!_(album)link
null324304728other-wikipedia!_(album)other
209829289948058United_States_military_award_devices"A"_Devicelink
19484489948015USS_Yorktown_(CV-5)"A"_Devicelink
null89948017other-google"A"_Deviceother
null89948013other-empty"A"_Deviceother
null89948029other-wikipedia"A"_Deviceother
87824689948011American_Defense_Service_Medal"A"_Devicelink
85590189948024Overseas_Service_Ribbon"A"_Deviceother
20642789948033USS_Ranger_(CV-4)"A"_Devicelink
77369189948047Antarctica_Service_Medal"A"_Devicelink
2301720128299643Kinsey_Millhone"A"_Is_for_Alibilink
null128299645other-empty"A"_Is_for_Alibiother
null128299610other-yahoo"A"_Is_for_Alibiother
4700061282996207Sue_Grafton"A"_Is_for_Alibilink
null128299618other-other"A"_Is_for_Alibiother
null128299631other-wikipedia"A"_Is_for_Alibiother
null1282996272other-google"A"_Is_for_Alibiother
39606873128299610"W"_Is_for_Wasted"A"_Is_for_Alibilink
26181056900366617And"And"_theory_of_conservatismlink
null9003666109other-wikipedia"And"_theory_of_conservatismother
null900366618other-google"And"_theory_of_conservatismother
null3907252949other-google"Bassy"_Bob_Brockmannother
null3907252910other-other"Bassy"_Bob_Brockmannother
11273993null15Colt_1851_Navy_Revolver"Bigfoot"_Wallaceredlink
125711332503397912"V"_Is_for_Vagina"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
1134682503397924The_Mission"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
140960782503397915Trent_Reznor_discography"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
null2503397942other-empty"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
13756142503397915Tapeworm_(band)"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
1595472503397925Milla_Jovovich"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
286393972503397973Sound_into_Blood_into_Wine"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
18934652503397930Carina_Round"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
336228872503397910Conditions_of_My_Parole"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
1476922503397925Tim_Alexander"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
461979025033979593Puscifer"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
null2503397936other-wikipedia"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
null2503397993other-google"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
69161null51Tết"Chúc_Mừng_Năm_Mới"_or_best_wishes_for_the_new_year.redlink
1438509null14List_of_Old_West_gunfighters"Cool_Hand_Conor"_O'Neillredlink
null3315866820other-google"Crocodile"_Dundeeother
null33158620other-twitter"Crocodile"_Dundeeother
null331586781other-wikipedia"Crocodile"_Dundeeother
48903333158659List_of_Academy_Awards_ceremonies"Crocodile"_Dundeelink
1004060633158638List_of_Australian_films"Crocodile"_Dundeeother
2564144331586154Crocodile_Dundee_in_Los_Angeles"Crocodile"_Dundeelink
612792833158614Bobby_Alto"Crocodile"_Dundeeother
15217133158613Baz_Luhrmann"Crocodile"_Dundeelink
8078282331586348Australia_(2008_film)"Crocodile"_Dundeelink
37386608331586662015_in_film"Crocodile"_Dundeelink
34557331586121980s"Crocodile"_Dundeeother
1118809331586297"Crocodile"_Dundee_II"Crocodile"_Dundeelink
703333158652Caitlin_Clarke"Crocodile"_Dundeeother
7276633158631Dundee_(disambiguation)"Crocodile"_Dundeeother
1716123315862211986_in_film"Crocodile"_Dundeelink
237645233158634Australian_New_Wave"Crocodile"_Dundeeother
124807433158660David_Gulpilil"Crocodile"_Dundeelink
86524133158610Crocodile_Hunter"Crocodile"_Dundeeother
19602033158612Crocodilia"Crocodile"_Dundeelink
64364933158685List_of_most_watched_television_broadcasts"Crocodile"_Dundeelink
830652133158613Anne_Carlisle"Crocodile"_Dundeeother
144896933158618Bart_vs._Australia"Crocodile"_Dundeeother
70209331586153Cinema_of_Australia"Crocodile"_Dundeelink
40081733315861859th_Academy_Awards"Crocodile"_Dundeelink
33146033158617Bowie_knife"Crocodile"_Dundeelink
3788233158621Crocodile"Crocodile"_Dundeeother
447899343315861283Deaths_in_2015"Crocodile"_Dundeelink
2234457933158630Academy_Award_for_Best_Original_Screenplay"Crocodile"_Dundeelink
187250233158610Boy-Scoutz_'n_the_Hood"Crocodile"_Dundeeother
564433158613Comedy_film"Crocodile"_Dundeelink
45834033158610List_of_films_set_in_New_York_City"Crocodile"_Dundeeother

Showing the first 1000 rows.

Display is a utility provided by Databricks. If you are programming directly in Spark, use the show(numRows: Int) function of DataFrame

clickstream.show(5)
+-------+-------+---+------------------+----------+-----+ |prev_id|curr_id| n| prev_title|curr_title| type| +-------+-------+---+------------------+----------+-----+ | null|3632887|121| other-google| !!|other| | null|3632887| 93| other-wikipedia| !!|other| | null|3632887| 46| other-empty| !!|other| | null|3632887| 10| other-other| !!|other| | 64486|3632887| 11|!_(disambiguation)| !!|other| +-------+-------+---+------------------+----------+-----+ only showing top 5 rows

Reading from disk vs memory

The 1.2 GB Clickstream file is currently on S3, which means each time you scan through it, your Spark cluster has to read the 1.2 GB of data remotely over the network.

Call the count() action to check how many rows are in the DataFrame and to see how long it takes to read the DataFrame from S3.

clickstream.cache().count()
res7: Long = 22509897
  • It took about several minutes to read the 1.2 GB file into your Spark cluster. The file has 22.5 million rows/lines.
  • Although we have called cache, remember that it is evaluated (cached) only when an action(count) is called

Now call count again to see how much faster it is to read from memory

clickstream.count()
res8: Long = 22509897
  • Orders of magnitude faster!
  • If you are going to be using the same data source multiple times, it is better to cache it in memory

What are the top 10 articles requested?

To do this we also need to order by the sum of column n, in descending order.

//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("n"))
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))
Main_Page127500620
87th_Academy_Awards2559794
Fifty_Shades_of_Grey2326175
Alive2244781
Chris_Kyle1709341
Fifty_Shades_of_Grey_(film)1683892
Deaths_in_20151614577
Birdman_(film)1545842
Islamic_State_of_Iraq_and_the_Levant1406530
Stephen_Hawking1384193

Who sent the most traffic to Wikipedia in Feb 2015?

In other words, who were the top referers to Wikipedia?

display(clickstream
  .select(clickstream("prev_title"), clickstream("n"))
  .groupBy("prev_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))
other-google1496209976
other-empty347693595
other-wikipedia129772279
other-other77569671
other-bing65962792
other-yahoo48501171
Main_Page29923502
other-twitter19241298
other-facebook2314026
87th_Academy_Awards1680675

As expected, the top referer by a large margin is Google. Next comes refererless traffic (usually clients using HTTPS). The third largest sender of traffic to English Wikipedia are Wikipedia pages that are not in the main namespace (ns = 0) of English Wikipedia. Learn about the Wikipedia namespaces here: https://en.wikipedia.org/wiki/Wikipedia:Project_namespace

Also, note that Twitter sends 10x more requests to Wikipedia than Facebook.

What were the top 5 trending articles people from Twitter were looking up in Wikipedia?

//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("prev_title"), clickstream("n"))
  .filter("prev_title = 'other-twitter'")
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(5))
Johnny_Knoxville198908
Peter_Woodcock126259
2002_Tampa_plane_crash119906
Sơn_Đoòng_Cave116012
The_boy_Jones114401

What percentage of page visits in Wikipedia are from other pages in Wikipedia itself?

val allClicks = clickstream.selectExpr("sum(n)").first.getLong(0)
val referals = clickstream.
                filter(clickstream("prev_id").isNotNull).
                selectExpr("sum(n)").first.getLong(0)
(referals * 100.0) / allClicks
allClicks: Long = 3283067885 referals: Long = 1095462001 res12: Double = 33.36702253416853

Register the DataFrame to perform more complex queries

clickstream.createOrReplaceTempView("clicks")

Which Wikipedia pages have the most referrals to the Donald Trump page?

%sql   
SELECT *
FROM clicks
WHERE 
  curr_title = 'Donald_Trump' AND
  prev_id IS NOT NULL AND prev_title != 'Main_Page'
ORDER BY n DESC
LIMIT 20
186144148482724658Ivanka_TrumpDonald_Trumplink
484827248482722212Donald_TrumpDonald_Trumplink
120907548482721855Melania_TrumpDonald_Trumplink
105788748482721760Ivana_TrumpDonald_Trumplink
567911948482721074Donald_Trump_Jr.Donald_Trumplink
213772514848272918United_States_presidential_election,_2016Donald_Trumplink
80955894848272728Eric_TrumpDonald_Trumplink
4738064848272652Marla_MaplesDonald_Trumplink
25651364848272651The_Trump_OrganizationDonald_Trumplink
99176934848272599The_Celebrity_ApprenticeDonald_Trumplink
92894804848272597The_Apprentice_(U.S._TV_series)Donald_Trumplink
2903274848272596German_AmericanDonald_Trumplink
126434974848272585Comedy_Central_RoastDonald_Trumplink
376439994848272549Republican_Party_presidential_candidates,_2016Donald_Trumplink
4175594848272543Alan_SugarDonald_Trumplink
12033164848272489Fred_TrumpDonald_Trumplink
3039514848272426Vince_McMahonDonald_Trumplink
61910534848272413Jared_KushnerDonald_Trumplink
12952164848272412Trump_Tower_(New_York_City)Donald_Trumplink
65092784848272402TrumpDonald_Trumplink

Top referrers to all presidential candidate pages

%sql   
-- FIXME (broke query, will get back to it later)
SELECT *
FROM clicks
WHERE 
  prev_id IS NOT NULL
ORDER BY n DESC
LIMIT 20
1558037444789934769616Main_PageDeaths_in_2015link
3516685040218034368694Fifty_Shades_of_GreyFifty_Shades_of_Grey_(film)link
402180347000810284352Fifty_Shades_of_Grey_(film)Dakota_Johnsonlink
3579370637371793253460Arrow_(TV_series)List_of_Arrow_episodeslink
3516685043180929249155Fifty_Shades_of_GreyFifty_Shades_Darkerlink
402180346138391228742Fifty_Shades_of_Grey_(film)Jamie_Dornanlink
4318092935910161220788Fifty_Shades_DarkerFifty_Shades_Freedlink
2767661640265175192321The_Walking_Dead_(TV_series)The_Walking_Dead_(season_5)link
61383911076962185700Jamie_DornanAmelia_Warnerlink
1937614844375105185449Stephen_HawkingJane_Wilde_Hawkinglink
2767661628074027161407The_Walking_Dead_(TV_series)List_of_The_Walking_Dead_episodeslink
3414912341844524161081List_of_The_Flash_episodesThe_Flash_(2014_TV_series)other
1126960513542396156313The_Big_Bang_TheoryList_of_The_Big_Bang_Theory_episodeslink
3946243134271398152892American_Sniper_(film)Chris_Kylelink
155803741738148148820Main_PageLimpetother
1558037445298077140335Main_PageTransAsia_Airways_Flight_235other
7000810484101139682Dakota_JohnsonMelanie_Griffithlink
4511931042567340138179Take_Me_to_ChurchTake_Me_to_Church_(Hozier_song)link
3896278741126542136236The_Blacklist_(TV_series)List_of_The_Blacklist_episodeslink
3226276745305174135900Better_Call_SaulUno_(Better_Call_Saul)link

Load a visualization library

This code is copied after doing a live google search (by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088). The d3ivan package is an updated version of the original package used by Michael Armbrust as it needed some TLC for Spark 2.2 on newer databricks notebook. These changes were kindly made by Ivan Sadikov from Middle Earth.

Warning: classes defined within packages cannot be redefined without a cluster restart. Compilation successful.
d3ivan.graphs.help()
d3ivan.graphs.force(
  height = 800,
  width = 1000,
  clicks = sql("""
    SELECT 
      prev_title AS src,
      curr_title AS dest,
      n AS count FROM clicks
    WHERE 
      curr_title IN ('Donald_Trump', 'Bernie_Sanders', 'Hillary_Rodham_Clinton', 'Ted_Cruz') AND
      prev_id IS NOT NULL AND prev_title != 'Main_Page'
    ORDER BY n DESC
    LIMIT 20""").as[d3ivan.Edge])

Convert raw data to parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. It is a more efficient way to store data frames.

// Convert the DatFrame to a more efficent format to speed up our analysis
clickstream.
  write.
  mode(SaveMode.Overwrite).
  parquet("/datasets/wiki-clickstream") // warnings are harmless

Load parquet file efficiently and quickly into a DataFrame

Now we can simply load from this parquet file next time instead of creating the RDD from the text file (much slower).

Also using parquet files to store DataFrames allows us to go between languages quickly in a a scalable manner.

val clicks = sqlContext.read.parquet("/datasets/wiki-clickstream")
clicks: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]
clicks.printSchema
root |-- prev_id: integer (nullable = true) |-- curr_id: integer (nullable = true) |-- n: integer (nullable = true) |-- prev_title: string (nullable = true) |-- curr_title: string (nullable = true) |-- type: string (nullable = true)
display(clicks)  // let's display this DataFrame
700988116400321MayallJohn_Mayalllink
47678616400386Mick_TaylorJohn_Mayalllink
1973554716400310Peter_Green_discographyJohn_Mayalllink
24413616400310MacclesfieldJohn_Mayalllink
3310575516400313The_YardbirdsJohn_Mayalllink
891043016400334The_Turning_Point_(John_Mayall_album)John_Mayalllink
32987816400310Steve_MarriottJohn_Mayalllink
null164003652other-emptyJohn_Mayallother
null147396134other-bingJohn_Mayall_&_the_Bluesbreakersother
1786548414739613Timeline_of_heavy_metal_and_hard_rock_musicJohn_Mayall_&_the_Bluesbreakersother
1558037414739694Main_PageJohn_Mayall_&_the_Bluesbreakersother
16825414739623Paul_ButterfieldJohn_Mayall_&_the_Bluesbreakerslink
322138147396283Peter_Green_(musician)John_Mayall_&_the_Bluesbreakerslink
null14739679other-otherJohn_Mayall_&_the_Bluesbreakersother
1215492614739613Marshall_BluesbreakerJohn_Mayall_&_the_Bluesbreakerslink
22391014739612Robben_FordJohn_Mayall_&_the_Bluesbreakersother
1443363714739610Parchman_Farm_(song)John_Mayall_&_the_Bluesbreakerslink
476786147396213Mick_TaylorJohn_Mayall_&_the_Bluesbreakerslink
1895228214739613Ric_GrechJohn_Mayall_&_the_Bluesbreakersother
411374114739650Rolling_Stone's_500_Greatest_Albums_of_All_TimeJohn_Mayall_&_the_Bluesbreakerslink
3666814739664Mick_FleetwoodJohn_Mayall_&_the_Bluesbreakerslink
null147396328other-emptyJohn_Mayall_&_the_Bluesbreakersother
16670514739610Thin_LizzyJohn_Mayall_&_the_Bluesbreakerslink
33105755147396115The_YardbirdsJohn_Mayall_&_the_Bluesbreakerslink
607139214739645Walter_TroutJohn_Mayall_&_the_Bluesbreakersother
null147396269other-wikipediaJohn_Mayall_&_the_Bluesbreakersother
null14739621other-twitterJohn_Mayall_&_the_Bluesbreakersother
null1473961632other-googleJohn_Mayall_&_the_Bluesbreakersother
null14739684other-yahooJohn_Mayall_&_the_Bluesbreakersother
27719751473961770th_Birthday_ConcertJohn_Mayall_&_the_Bluesbreakerslink
315352614739621Crossroads_(Eric_Clapton_album)John_Mayall_&_the_Bluesbreakerslink
156875314739638Aynsley_DunbarJohn_Mayall_&_the_Bluesbreakerslink
205616714739610Andy_FraserJohn_Mayall_&_the_Bluesbreakerslink
8145614739614Brian_JonesJohn_Mayall_&_the_Bluesbreakersother
592789914739664Crusade_(album)John_Mayall_&_the_Bluesbreakerslink
277197814739651A_Hard_RoadJohn_Mayall_&_the_Bluesbreakerslink
589882814739632Coco_MontoyaJohn_Mayall_&_the_Bluesbreakerslink
16767414739635Canned_HeatJohn_Mayall_&_the_Bluesbreakerslink
164003147396303John_MayallJohn_Mayall_&_the_Bluesbreakerslink
314567714739628Joe_BonamassaJohn_Mayall_&_the_Bluesbreakerslink
1181775814739628For_Your_LoveJohn_Mayall_&_the_Bluesbreakerslink
537530147396106John_McVieJohn_Mayall_&_the_Bluesbreakerslink
4282623114739614List_of_1960s_musical_artistsJohn_Mayall_&_the_Bluesbreakersother
1356751114739614I'm_Your_WitchdoctorJohn_Mayall_&_the_Bluesbreakerslink
28063614739626Gary_MooreJohn_Mayall_&_the_Bluesbreakersother
1235871614739614List_of_blues_rock_musiciansJohn_Mayall_&_the_Bluesbreakerslink
125935714739612Don_"Sugarcane"_HarrisJohn_Mayall_&_the_Bluesbreakerslink
2410549114739629Bare_WiresJohn_Mayall_&_the_Bluesbreakerslink
465823314739610Crawling_up_a_HillJohn_Mayall_&_the_Bluesbreakerslink
632994147396101Blues_rockJohn_Mayall_&_the_Bluesbreakerslink
17235614739610British_bluesJohn_Mayall_&_the_Bluesbreakerslink
53414147396114Cream_(band)John_Mayall_&_the_Bluesbreakerslink
335214739610BluesJohn_Mayall_&_the_Bluesbreakerslink
431414739643Black_SabbathJohn_Mayall_&_the_Bluesbreakerslink
11787147396286Fleetwood_MacJohn_Mayall_&_the_Bluesbreakerslink
3776567114739620Greg_RzabJohn_Mayall_&_the_Bluesbreakerslink
1621114739618John_MartynJohn_Mayall_&_the_Bluesbreakersother
359846614739675Eric_Clapton_discographyJohn_Mayall_&_the_Bluesbreakerslink
10049147396481Eric_ClaptonJohn_Mayall_&_the_Bluesbreakerslink
1944012814739631John_Mayall_Plays_John_MayallJohn_Mayall_&_the_Bluesbreakerslink
17880214739660Jack_BruceJohn_Mayall_&_the_Bluesbreakerslink
47960014739613Fresh_CreamJohn_Mayall_&_the_Bluesbreakerslink
null2291648243other-emptyJohn_Mayberry,_Jr.other
308472291648215Toronto_Blue_JaysJohn_Mayberry,_Jr.link
2172822916482107New_York_MetsJohn_Mayberry,_Jr.link
26747682291648252John_MayberryJohn_Mayberry,_Jr.link
1309480622916482242002_Major_League_Baseball_DraftJohn_Mayberry,_Jr.link
4377567422916482542015_New_York_Mets_seasonJohn_Mayberry,_Jr.link
null2291648225other-otherJohn_Mayberry,_Jr.other
null2291648223other-bingJohn_Mayberry,_Jr.other
220183002291648218List_of_current_Major_League_Baseball_team_rostersJohn_Mayberry,_Jr.link
19292532291648219List_of_second-generation_Major_League_Baseball_playersJohn_Mayberry,_Jr.link
974962722916482112005_Major_League_Baseball_DraftJohn_Mayberry,_Jr.link
4066415022916482122014_Toronto_Blue_Jays_seasonJohn_Mayberry,_Jr.link
null2291648224other-wikipediaJohn_Mayberry,_Jr.other
null22916482379other-googleJohn_Mayberry,_Jr.other
null2291648223other-yahooJohn_Mayberry,_Jr.other
null2121872410other-yahooJohn_Mayerother
560068621218724Cars_(soundtrack)John_Mayerlink
1186010721218716As_I_AmJohn_Mayerlink
12339121218736Bozeman,_MontanaJohn_Mayerlink
3237035321218710+_(Ed_Sheeran_album)John_Mayerother
26127858212187130Bucky_Larson:_Born_to_Be_a_StarJohn_Mayerlink
128251021218718As/IsJohn_Mayerlink
1252086021218710Demi_LovatoJohn_Mayerlink
681420921218710Contact_granulomaJohn_Mayerlink
848854521218710Call_Me_the_BreezeJohn_Mayerlink
63299421218778Blues_rockJohn_Mayerlink
1343824721218732Andy_Cohen_(television_personality)John_Mayerother
242641842121871853rd_Annual_Grammy_AwardsJohn_Mayerlink
1241670921218710Colbie_CaillatJohn_Mayerlink
675407221218710Daniel_Levy_(TV_personality)John_Mayerlink
23703056212187141Battle_Studies_(album)John_Mayerlink
1304116321218726AdeleJohn_Mayerlink
777232621218717Aware_RecordsJohn_Mayerlink
1316910121218726Comedy_CellarJohn_Mayerlink
157337921218764Dear_John_letterJohn_Mayerlink
3938406421218739657th_Annual_Grammy_AwardsJohn_Mayerlink
1260737212187120Crossroads_Guitar_FestivalJohn_Mayerlink
630466321218777Daughters_(John_Mayer_song)John_Mayerlink

Showing the first 1000 rows.

DataFrame in python
%py
clicksPy = sqlContext.read.parquet("/datasets/wiki-clickstream")
%py
# in Python you need to put the object int its own line like this to get the type information
clicksPy 
Out[2]: DataFrame[prev_id: int, curr_id: int, n: int, prev_title: string, curr_title: string, type: string]
%py
clicksPy.show()
+--------+-------+---+--------------------+--------------------+-----+ | prev_id|curr_id| n| prev_title| curr_title| type| +--------+-------+---+--------------------+--------------------+-----+ | 7009881| 164003| 21| Mayall| John_Mayall| link| | 476786| 164003| 86| Mick_Taylor| John_Mayall| link| |19735547| 164003| 10|Peter_Green_disco...| John_Mayall| link| | 244136| 164003| 10| Macclesfield| John_Mayall| link| |33105755| 164003| 13| The_Yardbirds| John_Mayall| link| | 8910430| 164003| 34|The_Turning_Point...| John_Mayall| link| | 329878| 164003| 10| Steve_Marriott| John_Mayall| link| | null| 164003|652| other-empty| John_Mayall|other| | null| 147396|134| other-bing|John_Mayall_&_the...|other| |17865484| 147396| 13|Timeline_of_heavy...|John_Mayall_&_the...|other| |15580374| 147396| 94| Main_Page|John_Mayall_&_the...|other| | 168254| 147396| 23| Paul_Butterfield|John_Mayall_&_the...| link| | 322138| 147396|283|Peter_Green_(musi...|John_Mayall_&_the...| link| | null| 147396| 79| other-other|John_Mayall_&_the...|other| |12154926| 147396| 13|Marshall_Bluesbre...|John_Mayall_&_the...| link| | 223910| 147396| 12| Robben_Ford|John_Mayall_&_the...|other| |14433637| 147396| 10|Parchman_Farm_(song)|John_Mayall_&_the...| link| | 476786| 147396|213| Mick_Taylor|John_Mayall_&_the...| link| |18952282| 147396| 13| Ric_Grech|John_Mayall_&_the...|other| | 4113741| 147396| 50|Rolling_Stone's_5...|John_Mayall_&_the...| link| +--------+-------+---+--------------------+--------------------+-----+ only showing top 20 rows

Now you can continue from the original python notebook tweeted by Michael.

Recall from the beginning of this notebook that this python databricks notebook was used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088

(watch now, if you haven't already!)

Michael Armbrust Spark Summit East

You Try!

Try to laoad a DataFrame in R from the parquet file just as we did for python. Read the docs in databricks guide first:

And see the R example in the Programming Guide:

%r
library(SparkR)

# just a quick test
df <- createDataFrame(faithful)
head(df)

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, intersect, rank, rbind,
    sample, subset, summary, transform, union

  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55
%r
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
clicksR <- read.df("/datasets/wiki-clickstream", source = "parquet")
clicksR # in R you need to put the object int its own line like this to get the type information
SparkDataFrame[prev_id:int, curr_id:int, n:int, prev_title:string, curr_title:string, type:string]
%r
head(clicksR)
   prev_id curr_id  n                            prev_title  curr_title type
1  7009881  164003 21                                Mayall John_Mayall link
2   476786  164003 86                           Mick_Taylor John_Mayall link
3 19735547  164003 10               Peter_Green_discography John_Mayall link
4   244136  164003 10                          Macclesfield John_Mayall link
5 33105755  164003 13                         The_Yardbirds John_Mayall link
6  8910430  164003 34 The_Turning_Point_(John_Mayall_album) John_Mayall link
%r
display(clicksR)
700988116400321MayallJohn_Mayalllink
47678616400386Mick_TaylorJohn_Mayalllink
1973554716400310Peter_Green_discographyJohn_Mayalllink
24413616400310MacclesfieldJohn_Mayalllink
3310575516400313The_YardbirdsJohn_Mayalllink
891043016400334The_Turning_Point_(John_Mayall_album)John_Mayalllink
32987816400310Steve_MarriottJohn_Mayalllink
null164003652other-emptyJohn_Mayallother
null147396134other-bingJohn_Mayall_&_the_Bluesbreakersother
1786548414739613Timeline_of_heavy_metal_and_hard_rock_musicJohn_Mayall_&_the_Bluesbreakersother
1558037414739694Main_PageJohn_Mayall_&_the_Bluesbreakersother
16825414739623Paul_ButterfieldJohn_Mayall_&_the_Bluesbreakerslink
322138147396283Peter_Green_(musician)John_Mayall_&_the_Bluesbreakerslink
null14739679other-otherJohn_Mayall_&_the_Bluesbreakersother
1215492614739613Marshall_BluesbreakerJohn_Mayall_&_the_Bluesbreakerslink
22391014739612Robben_FordJohn_Mayall_&_the_Bluesbreakersother
1443363714739610Parchman_Farm_(song)John_Mayall_&_the_Bluesbreakerslink
476786147396213Mick_TaylorJohn_Mayall_&_the_Bluesbreakerslink
1895228214739613Ric_GrechJohn_Mayall_&_the_Bluesbreakersother
411374114739650Rolling_Stone's_500_Greatest_Albums_of_All_TimeJohn_Mayall_&_the_Bluesbreakerslink
3666814739664Mick_FleetwoodJohn_Mayall_&_the_Bluesbreakerslink
null147396328other-emptyJohn_Mayall_&_the_Bluesbreakersother
16670514739610Thin_LizzyJohn_Mayall_&_the_Bluesbreakerslink
33105755147396115The_YardbirdsJohn_Mayall_&_the_Bluesbreakerslink
607139214739645Walter_TroutJohn_Mayall_&_the_Bluesbreakersother
null147396269other-wikipediaJohn_Mayall_&_the_Bluesbreakersother
null14739621other-twitterJohn_Mayall_&_the_Bluesbreakersother
null1473961632other-googleJohn_Mayall_&_the_Bluesbreakersother
null14739684other-yahooJohn_Mayall_&_the_Bluesbreakersother
27719751473961770th_Birthday_ConcertJohn_Mayall_&_the_Bluesbreakerslink
315352614739621Crossroads_(Eric_Clapton_album)John_Mayall_&_the_Bluesbreakerslink
156875314739638Aynsley_DunbarJohn_Mayall_&_the_Bluesbreakerslink
205616714739610Andy_FraserJohn_Mayall_&_the_Bluesbreakerslink
8145614739614Brian_JonesJohn_Mayall_&_the_Bluesbreakersother
592789914739664Crusade_(album)John_Mayall_&_the_Bluesbreakerslink
277197814739651A_Hard_RoadJohn_Mayall_&_the_Bluesbreakerslink
589882814739632Coco_MontoyaJohn_Mayall_&_the_Bluesbreakerslink
16767414739635Canned_HeatJohn_Mayall_&_the_Bluesbreakerslink
164003147396303John_MayallJohn_Mayall_&_the_Bluesbreakerslink
314567714739628Joe_BonamassaJohn_Mayall_&_the_Bluesbreakerslink
1181775814739628For_Your_LoveJohn_Mayall_&_the_Bluesbreakerslink
537530147396106John_McVieJohn_Mayall_&_the_Bluesbreakerslink
4282623114739614List_of_1960s_musical_artistsJohn_Mayall_&_the_Bluesbreakersother
1356751114739614I'm_Your_WitchdoctorJohn_Mayall_&_the_Bluesbreakerslink
28063614739626Gary_MooreJohn_Mayall_&_the_Bluesbreakersother
1235871614739614List_of_blues_rock_musiciansJohn_Mayall_&_the_Bluesbreakerslink
125935714739612Don_"Sugarcane"_HarrisJohn_Mayall_&_the_Bluesbreakerslink
2410549114739629Bare_WiresJohn_Mayall_&_the_Bluesbreakerslink
465823314739610Crawling_up_a_HillJohn_Mayall_&_the_Bluesbreakerslink
632994147396101Blues_rockJohn_Mayall_&_the_Bluesbreakerslink
17235614739610British_bluesJohn_Mayall_&_the_Bluesbreakerslink
53414147396114Cream_(band)John_Mayall_&_the_Bluesbreakerslink
335214739610BluesJohn_Mayall_&_the_Bluesbreakerslink
431414739643Black_SabbathJohn_Mayall_&_the_Bluesbreakerslink
11787147396286Fleetwood_MacJohn_Mayall_&_the_Bluesbreakerslink
3776567114739620Greg_RzabJohn_Mayall_&_the_Bluesbreakerslink
1621114739618John_MartynJohn_Mayall_&_the_Bluesbreakersother
359846614739675Eric_Clapton_discographyJohn_Mayall_&_the_Bluesbreakerslink
10049147396481Eric_ClaptonJohn_Mayall_&_the_Bluesbreakerslink
1944012814739631John_Mayall_Plays_John_MayallJohn_Mayall_&_the_Bluesbreakerslink
17880214739660Jack_BruceJohn_Mayall_&_the_Bluesbreakerslink
47960014739613Fresh_CreamJohn_Mayall_&_the_Bluesbreakerslink
null2291648243other-emptyJohn_Mayberry,_Jr.other
308472291648215Toronto_Blue_JaysJohn_Mayberry,_Jr.link
2172822916482107New_York_MetsJohn_Mayberry,_Jr.link
26747682291648252John_MayberryJohn_Mayberry,_Jr.link
1309480622916482242002_Major_League_Baseball_DraftJohn_Mayberry,_Jr.link
4377567422916482542015_New_York_Mets_seasonJohn_Mayberry,_Jr.link
null2291648225other-otherJohn_Mayberry,_Jr.other
null2291648223other-bingJohn_Mayberry,_Jr.other
220183002291648218List_of_current_Major_League_Baseball_team_rostersJohn_Mayberry,_Jr.link
19292532291648219List_of_second-generation_Major_League_Baseball_playersJohn_Mayberry,_Jr.link
974962722916482112005_Major_League_Baseball_DraftJohn_Mayberry,_Jr.link
4066415022916482122014_Toronto_Blue_Jays_seasonJohn_Mayberry,_Jr.link
null2291648224other-wikipediaJohn_Mayberry,_Jr.other
null22916482379other-googleJohn_Mayberry,_Jr.other
null2291648223other-yahooJohn_Mayberry,_Jr.other
null2121872410other-yahooJohn_Mayerother
560068621218724Cars_(soundtrack)John_Mayerlink
1186010721218716As_I_AmJohn_Mayerlink
12339121218736Bozeman,_MontanaJohn_Mayerlink
3237035321218710+_(Ed_Sheeran_album)John_Mayerother
26127858212187130Bucky_Larson:_Born_to_Be_a_StarJohn_Mayerlink
128251021218718As/IsJohn_Mayerlink
1252086021218710Demi_LovatoJohn_Mayerlink
681420921218710Contact_granulomaJohn_Mayerlink
848854521218710Call_Me_the_BreezeJohn_Mayerlink
63299421218778Blues_rockJohn_Mayerlink
1343824721218732Andy_Cohen_(television_personality)John_Mayerother
242641842121871853rd_Annual_Grammy_AwardsJohn_Mayerlink
1241670921218710Colbie_CaillatJohn_Mayerlink
675407221218710Daniel_Levy_(TV_personality)John_Mayerlink
23703056212187141Battle_Studies_(album)John_Mayerlink
1304116321218726AdeleJohn_Mayerlink
777232621218717Aware_RecordsJohn_Mayerlink
1316910121218726Comedy_CellarJohn_Mayerlink
157337921218764Dear_John_letterJohn_Mayerlink
3938406421218739657th_Annual_Grammy_AwardsJohn_Mayerlink
1260737212187120Crossroads_Guitar_FestivalJohn_Mayerlink
630466321218777Daughters_(John_Mayer_song)John_Mayerlink

Showing the first 1000 rows.