010_wikipediaClickStream_01ETLEDA(Scala)

Wiki Clickstream Analysis

Dataset: 3.2 billion requests collected during the month of February 2015 grouped by (src, dest)

Source: https://datahub.io/dataset/wikipedia-clickstream/

NY clickstream image

This notebook requires Spark 1.6+.

This notebook was originally a data analysis workflow developed with Databricks Community Edition, a free version of Databricks designed for learning Apache Spark.

Here we elucidate the original python notebook (also linked here) used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088 (watch later)

Michael Armbrust Spark Summit East

Data set

Wikipedia Logo

The data we are exploring in this lab is the February 2015 English Wikipedia Clickstream data, and it is available here: http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82.

According to Wikimedia:

"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015."

The data is approximately 1.2GB and it is hosted in the following Databricks file: /databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed

display(dbutils.fs.ls("/databricks-datasets/wikipedia-datasets/"))
dbfs:/databricks-datasets/wikipedia-datasets/data-001/data-001/0

Let us first understand this Wikimedia data set a bit more

Let's read the datahub-hosted link https://datahub.io/dataset/wikipedia-clickstream in the embedding below. Also click the blog by Ellery Wulczyn, Data Scientist at The Wikimedia Foundation, to better understand how the data was generated (remember to Right-Click and use -> and <- if navigating within the embedded html frame below).

Show code

Run the next two cells for some housekeeping.

if (org.apache.spark.BuildInfo.sparkBranch < "1.6") sys.error("Attach this notebook to a cluster running Spark 1.6+")

Loading and Exploring the data

val data = sc.textFile("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")
data: org.apache.spark.rdd.RDD[String] = dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed MapPartitionsRDD[582] at textFile at command-112937334110573:1
Looking at the first few lines of the data
data.take(5).foreach(println) 
prev_id curr_id n prev_title curr_title type 3632887 121 other-google !! other 3632887 93 other-wikipedia !! other 3632887 46 other-empty !! other 3632887 10 other-other !! other
data.take(2)
res4: Array[String] = Array(prev_id curr_id n prev_title curr_title type, " 3632887 121 other-google !! other")
  • The first line looks like a header
  • The second line (separated from the first by ",") contains data organized according to the header, i.e., prev_id = 3632887, curr_id = 121", and so on.

Actually, here is the meaning of each column:

  • prev_id: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on

  • curr_id: the MediaWiki unique page ID of the article the client requested

  • prev_title: the result of mapping the referer URL to the fixed set of values described below

  • curr_title: the title of the article the client requested

  • n: the number of occurrences of the (referer, resource) pair

  • type

    • "link" if the referer and request are both articles and the referer links to the request
    • "redlink" if the referer is an article and links to the request, but the request is not in the production enwiki.page table
    • "other" if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer

Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources to English Wikipedia, based on this scheme:

  • an article in the main namespace of English Wikipedia -> the article title
  • any Wikipedia page that is not in the main namespace of English Wikipedia -> other-wikipedia
  • an empty referer -> other-empty
  • a page from any other Wikimedia project -> other-internal
  • Google -> other-google
  • Yahoo -> other-yahoo
  • Bing -> other-bing
  • Facebook -> other-facebook
  • Twitter -> other-twitter
  • anything else -> other-other

In the second line of the file above, we can see there were 121 clicks from Google to the Wikipedia page on "!!" (double exclamation marks). People search for everything!

  • prev_id = (nothing)
  • curr_id = 3632887 --> (Wikipedia page ID)
  • n = 121 (People clicked from Google to this page 121 times in this month.)
  • prev_title = other-google (This data record is for referals from Google.)
  • curr_title = !! (This Wikipedia page is about a double exclamation mark.)
  • type = other

Create a DataFrame from this CSV

  • From the next Spark release - 2.0, CSV as a datasource will be part of Spark's standard release. But, we are using Spark 1.6
// Load the raw dataset stored as a CSV file
val clickstream = sqlContext
    .read
    .format("com.databricks.spark.csv")
    .options(Map("header" -> "true", "delimiter" -> "\t", "mode" -> "PERMISSIVE", "inferSchema" -> "true"))
    .load("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")
  
clickstream: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]
Print the schema
clickstream.printSchema
root |-- prev_id: integer (nullable = true) |-- curr_id: integer (nullable = true) |-- n: integer (nullable = true) |-- prev_title: string (nullable = true) |-- curr_title: string (nullable = true) |-- type: string (nullable = true)

Display some sample data

display(clickstream)
null3632887121other-google!!other
null363288793other-wikipedia!!other
null363288746other-empty!!other
null363288710other-other!!other
64486363288711!_(disambiguation)!!other
2061699255696219Louden_Up_Now!!!_(album)link
null255696225other-empty!!!_(album)other
null255696216other-google!!!_(album)other
null255696244other-wikipedia!!!_(album)other
64486255696215!_(disambiguation)!!!_(album)link
6007442556962297!!!!!!_(album)link
null689331011other-empty!Hero_(album)other
1921683689331026!Hero!Hero_(album)link
null689331016other-wikipedia!Hero_(album)other
null689331023other-google!Hero_(album)other
81273042260247316Jericho_Rosales!Oka_Tokatlink
359788742260247320List_of_telenovelas_of_ABS-CBN!Oka_Tokatlink
null2260247357other-google!Oka_Tokatother
null2260247312other-wikipedia!Oka_Tokatother
null2260247323other-empty!Oka_Tokatother
73606872260247310Rica_Peralejo!Oka_Tokatlink
371045822260247311Jeepney_TV!Oka_Tokatlink
343765902260247322Oka_Tokat_(2012_TV_series)!Oka_Tokatlink
null681076820other-wikipedia!T.O.O.H.!other
null681076881other-google!T.O.O.H.!other
31976181681076851List_of_death_metal_bands,_!–K!T.O.O.H.!link
null681076835other-empty!T.O.O.H.!other
null324304721other-empty!_(album)other
13374753243047208The_Dismemberment_Plan!_(album)link
3284285324304778The_Dismemberment_Plan_Is_Terrified!_(album)link
null324304728other-wikipedia!_(album)other
209829289948058United_States_military_award_devices"A"_Devicelink
19484489948015USS_Yorktown_(CV-5)"A"_Devicelink
null89948017other-google"A"_Deviceother
null89948013other-empty"A"_Deviceother
null89948029other-wikipedia"A"_Deviceother
87824689948011American_Defense_Service_Medal"A"_Devicelink
85590189948024Overseas_Service_Ribbon"A"_Deviceother
20642789948033USS_Ranger_(CV-4)"A"_Devicelink
77369189948047Antarctica_Service_Medal"A"_Devicelink
2301720128299643Kinsey_Millhone"A"_Is_for_Alibilink
null128299645other-empty"A"_Is_for_Alibiother
null128299610other-yahoo"A"_Is_for_Alibiother
4700061282996207Sue_Grafton"A"_Is_for_Alibilink
null128299618other-other"A"_Is_for_Alibiother
null128299631other-wikipedia"A"_Is_for_Alibiother
null1282996272other-google"A"_Is_for_Alibiother
39606873128299610"W"_Is_for_Wasted"A"_Is_for_Alibilink
26181056900366617And"And"_theory_of_conservatismlink
null9003666109other-wikipedia"And"_theory_of_conservatismother
null900366618other-google"And"_theory_of_conservatismother
null3907252949other-google"Bassy"_Bob_Brockmannother
null3907252910other-other"Bassy"_Bob_Brockmannother
11273993null15Colt_1851_Navy_Revolver"Bigfoot"_Wallaceredlink
125711332503397912"V"_Is_for_Vagina"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
1134682503397924The_Mission"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
140960782503397915Trent_Reznor_discography"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
null2503397942other-empty"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
13756142503397915Tapeworm_(band)"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
1595472503397925Milla_Jovovich"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
286393972503397973Sound_into_Blood_into_Wine"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
18934652503397930Carina_Round"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
336228872503397910Conditions_of_My_Parole"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
1476922503397925Tim_Alexander"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
461979025033979593Puscifer"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)link
null2503397936other-wikipedia"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
null2503397993other-google"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)other
69161null51Tết"Chúc_Mừng_Năm_Mới"_or_best_wishes_for_the_new_year.redlink
1438509null14List_of_Old_West_gunfighters"Cool_Hand_Conor"_O'Neillredlink
null3315866820other-google"Crocodile"_Dundeeother
null33158620other-twitter"Crocodile"_Dundeeother
null331586781other-wikipedia"Crocodile"_Dundeeother
48903333158659List_of_Academy_Awards_ceremonies"Crocodile"_Dundeelink
1004060633158638List_of_Australian_films"Crocodile"_Dundeeother
2564144331586154Crocodile_Dundee_in_Los_Angeles"Crocodile"_Dundeelink
612792833158614Bobby_Alto"Crocodile"_Dundeeother
15217133158613Baz_Luhrmann"Crocodile"_Dundeelink
8078282331586348Australia_(2008_film)"Crocodile"_Dundeelink
37386608331586662015_in_film"Crocodile"_Dundeelink
34557331586121980s"Crocodile"_Dundeeother
1118809331586297"Crocodile"_Dundee_II"Crocodile"_Dundeelink
703333158652Caitlin_Clarke"Crocodile"_Dundeeother
7276633158631Dundee_(disambiguation)"Crocodile"_Dundeeother
1716123315862211986_in_film"Crocodile"_Dundeelink
237645233158634Australian_New_Wave"Crocodile"_Dundeeother
124807433158660David_Gulpilil"Crocodile"_Dundeelink
86524133158610Crocodile_Hunter"Crocodile"_Dundeeother
19602033158612Crocodilia"Crocodile"_Dundeelink
64364933158685List_of_most_watched_television_broadcasts"Crocodile"_Dundeelink
830652133158613Anne_Carlisle"Crocodile"_Dundeeother
144896933158618Bart_vs._Australia"Crocodile"_Dundeeother
70209331586153Cinema_of_Australia"Crocodile"_Dundeelink
40081733315861859th_Academy_Awards"Crocodile"_Dundeelink
33146033158617Bowie_knife"Crocodile"_Dundeelink
3788233158621Crocodile"Crocodile"_Dundeeother
447899343315861283Deaths_in_2015"Crocodile"_Dundeelink
2234457933158630Academy_Award_for_Best_Original_Screenplay"Crocodile"_Dundeelink
187250233158610Boy-Scoutz_'n_the_Hood"Crocodile"_Dundeeother
564433158613Comedy_film"Crocodile"_Dundeelink
45834033158610List_of_films_set_in_New_York_City"Crocodile"_Dundeeother

Showing the first 1000 rows.

Display is a utility provided by Databricks. If you are programming directly in Spark, use the show(numRows: Int) function of DataFrame

clickstream.show(5)
+-------+-------+---+------------------+----------+-----+ |prev_id|curr_id| n| prev_title|curr_title| type| +-------+-------+---+------------------+----------+-----+ | null|3632887|121| other-google| !!|other| | null|3632887| 93| other-wikipedia| !!|other| | null|3632887| 46| other-empty| !!|other| | null|3632887| 10| other-other| !!|other| | 64486|3632887| 11|!_(disambiguation)| !!|other| +-------+-------+---+------------------+----------+-----+ only showing top 5 rows

Reading from disk vs memory

The 1.2 GB Clickstream file is currently on S3, which means each time you scan through it, your Spark cluster has to read the 1.2 GB of data remotely over the network.

Call the count() action to check how many rows are in the DataFrame and to see how long it takes to read the DataFrame from S3.

clickstream.cache().count()
res9: Long = 22509897
  • It took about several minutes to read the 1.2 GB file into your Spark cluster. The file has 22.5 million rows/lines.
  • Although we have called cache, remember that it is evaluated (cached) only when an action(count) is called

Now call count again to see how much faster it is to read from memory

clickstream.count()
res10: Long = 22509897
  • Orders of magnitude faster!
  • If you are going to be using the same data source multiple times, it is better to cache it in memory

What are the top 10 articles requested?

To do this we also need to order by the sum of column n, in descending order.

//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("n"))
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))
Main_Page127500620
87th_Academy_Awards2559794
Fifty_Shades_of_Grey2326175
Alive2244781
Chris_Kyle1709341
Fifty_Shades_of_Grey_(film)1683892
Deaths_in_20151614577
Birdman_(film)1545842
Islamic_State_of_Iraq_and_the_Levant1406530
Stephen_Hawking1384193

Who sent the most traffic to Wikipedia in Feb 2015?

In other words, who were the top referers to Wikipedia?

display(clickstream
  .select(clickstream("prev_title"), clickstream("n"))
  .groupBy("prev_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))
other-google1496209976
other-empty347693595
other-wikipedia129772279
other-other77569671
other-bing65962792
other-yahoo48501171
Main_Page29923502
other-twitter19241298
other-facebook2314026
87th_Academy_Awards1680675

As expected, the top referer by a large margin is Google. Next comes refererless traffic (usually clients using HTTPS). The third largest sender of traffic to English Wikipedia are Wikipedia pages that are not in the main namespace (ns = 0) of English Wikipedia. Learn about the Wikipedia namespaces here: https://en.wikipedia.org/wiki/Wikipedia:Project_namespace

Also, note that Twitter sends 10x more requests to Wikipedia than Facebook.

What were the top 5 trending articles people from Twitter were looking up in Wikipedia?

//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("prev_title"), clickstream("n"))
  .filter("prev_title = 'other-twitter'")
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(5))
Johnny_Knoxville198908
Peter_Woodcock126259
2002_Tampa_plane_crash119906
Sơn_Đoòng_Cave116012
The_boy_Jones114401

What percentage of page visits in Wikipedia are from other pages in Wikipedia itself?

val allClicks = clickstream.selectExpr("sum(n)").first.getLong(0)
val referals = clickstream.
                filter(clickstream("prev_id").isNotNull).
                selectExpr("sum(n)").first.getLong(0)
(referals * 100.0) / allClicks
allClicks: Long = 3283067885 referals: Long = 1095462001 res18: Double = 33.36702253416853

Register the DataFrame to perform more complex queries

clickstream.createOrReplaceTempView("clicks")

Which Wikipedia pages have the most referrals to the Donald Trump page?

%sql   
SELECT *
FROM clicks
WHERE 
  curr_title = 'Donald_Trump' AND
  prev_id IS NOT NULL AND prev_title != 'Main_Page'
ORDER BY n DESC
LIMIT 20
186144148482724658Ivanka_TrumpDonald_Trumplink
484827248482722212Donald_TrumpDonald_Trumplink
120907548482721855Melania_TrumpDonald_Trumplink
105788748482721760Ivana_TrumpDonald_Trumplink
567911948482721074Donald_Trump_Jr.Donald_Trumplink
213772514848272918United_States_presidential_election,_2016Donald_Trumplink
80955894848272728Eric_TrumpDonald_Trumplink
4738064848272652Marla_MaplesDonald_Trumplink
25651364848272651The_Trump_OrganizationDonald_Trumplink
99176934848272599The_Celebrity_ApprenticeDonald_Trumplink
92894804848272597The_Apprentice_(U.S._TV_series)Donald_Trumplink
2903274848272596German_AmericanDonald_Trumplink
126434974848272585Comedy_Central_RoastDonald_Trumplink
376439994848272549Republican_Party_presidential_candidates,_2016Donald_Trumplink
4175594848272543Alan_SugarDonald_Trumplink
12033164848272489Fred_TrumpDonald_Trumplink
3039514848272426Vince_McMahonDonald_Trumplink
61910534848272413Jared_KushnerDonald_Trumplink
12952164848272412Trump_Tower_(New_York_City)Donald_Trumplink
65092784848272402TrumpDonald_Trumplink

YouTry: Top referrers to other 2016 US presidential candidate pages

'Donald_Trump', 'Bernie_Sanders', 'Hillary_Rodham_Clinton', 'Ted_Cruz'

%sql   
-- YouTry 
---
-- fill in the right sql query here

Load a visualization library

This code is copied after doing a live google search (by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088). The d3ivan package is an updated version of the original package used by Michael Armbrust as it needed some TLC for Spark 2.2 on newer databricks notebook. These changes were kindly made by Ivan Sadikov from Middle Earth.

Show code
Warning: classes defined within packages cannot be redefined without a cluster restart. Compilation successful.
d3ivan.graphs.help()
d3ivan.graphs.force(
  height = 800,
  width = 800,
  clicks = sql("""
    SELECT 
      prev_title AS src,
      curr_title AS dest,
      n AS count FROM clicks
    WHERE 
      curr_title IN ('Donald_Trump', 'Bernie_Sanders', 'Hillary_Rodham_Clinton', 'Ted_Cruz') AND
      prev_id IS NOT NULL AND prev_title != 'Main_Page'
    ORDER BY n DESC
    LIMIT 20""").as[d3ivan.Edge])

Convert raw data to parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. It is a more efficient way to store data frames.

Show code
// Convert the DatFrame to a more efficent format to speed up our analysis
clickstream.
  write.
  mode(SaveMode.Overwrite).
  parquet("/datasets/wiki-clickstream") 

Load parquet file efficiently and quickly into a DataFrame

Now we can simply load from this parquet file next time instead of creating the RDD from the text file (much slower).

Also using parquet files to store DataFrames allows us to go between languages quickly in a a scalable manner.

val clicks = sqlContext.read.parquet("/datasets/wiki-clickstream")
clicks: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]
clicks.printSchema
root |-- prev_id: integer (nullable = true) |-- curr_id: integer (nullable = true) |-- n: integer (nullable = true) |-- prev_title: string (nullable = true) |-- curr_title: string (nullable = true) |-- type: string (nullable = true)
display(clicks)  // let's display this DataFrame
1548497851104Strategic_Arms_Limitation_TalksComprehensive_Nuclear-Test-Ban_Treatylink
null7851100other-bingComprehensive_Nuclear-Test-Ban_Treatyother
null785182other-otherComprehensive_Nuclear-Test-Ban_Treatyother
15580374785131Main_PageComprehensive_Nuclear-Test-Ban_Treatyother
1455516785121Mercury,_NevadaComprehensive_Nuclear-Test-Ban_Treatylink
305927851158Partial_Nuclear_Test_Ban_TreatyComprehensive_Nuclear-Test-Ban_Treatylink
21785785152Nuclear_weaponComprehensive_Nuclear-Test-Ban_Treatylink
1786856785125Onyx_RiverComprehensive_Nuclear-Test-Ban_Treatyother
337775785192Nuclear_weapons_testingComprehensive_Nuclear-Test-Ban_Treatylink
22158785112Nuclear_proliferationComprehensive_Nuclear-Test-Ban_Treatylink
594209785113MoruroaComprehensive_Nuclear-Test-Ban_Treatyother
14604785112Foreign_relations_of_IndiaComprehensive_Nuclear-Test-Ban_Treatyother
499076785135Force_de_dissuasionComprehensive_Nuclear-Test-Ban_Treatylink
1838300785110Fissile_Material_Cut-off_TreatyComprehensive_Nuclear-Test-Ban_Treatyother
6174686785123India–United_States_Civil_Nuclear_AgreementComprehensive_Nuclear-Test-Ban_Treatyother
3973438785147High-altitude_nuclear_explosionComprehensive_Nuclear-Test-Ban_Treatylink
2962287785148Comprehensive_Nuclear-Test-Ban_Treaty_OrganizationComprehensive_Nuclear-Test-Ban_Treatylink
9486785158List_of_international_environmental_agreementsComprehensive_Nuclear-Test-Ban_Treatyother
17671170785114List_of_weapons_of_mass_destruction_treatiesComprehensive_Nuclear-Test-Ban_Treatylink
41535782785120List_of_nuclear_weapons_tests_of_the_Soviet_UnionComprehensive_Nuclear-Test-Ban_Treatyother
null78514143other-googleComprehensive_Nuclear-Test-Ban_Treatyother
null7851218other-wikipediaComprehensive_Nuclear-Test-Ban_Treatyother
49750785122Arms_controlComprehensive_Nuclear-Test-Ban_Treatyother
346367851121996Comprehensive_Nuclear-Test-Ban_Treatylink
589108785143China_and_weapons_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
585476785113Atomic_Weapons_EstablishmentComprehensive_Nuclear-Test-Ban_Treatylink
4008217851524List_of_states_with_nuclear_weaponsComprehensive_Nuclear-Test-Ban_Treatylink
704801785113List_of_United_States_treatiesComprehensive_Nuclear-Test-Ban_Treatyother
2189647785140List_of_nuclear_weapons_testsComprehensive_Nuclear-Test-Ban_Treatyother
14533785147IndiaComprehensive_Nuclear-Test-Ban_Treatylink
10737785111French_PolynesiaComprehensive_Nuclear-Test-Ban_Treatyother
38404161785113Historical_nuclear_weapons_stockpiles_and_nuclear_tests_by_countryComprehensive_Nuclear-Test-Ban_Treatyother
215176785110InfrasoundComprehensive_Nuclear-Test-Ban_Treatyother
589091785118France_and_weapons_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
740008785146India_and_weapons_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
31446831785117Preparatory_Commission_for_the_Comprehensive_Nuclear-Test-Ban_Treaty_OrganizationComprehensive_Nuclear-Test-Ban_Treatylink
162759785111Nevada_Test_SiteComprehensive_Nuclear-Test-Ban_Treatyother
53366785111Reconnaissance_satelliteComprehensive_Nuclear-Test-Ban_Treatyother
22165785120Nuclear_disarmamentComprehensive_Nuclear-Test-Ban_Treatylink
2824536785140Peaceful_nuclear_explosionComprehensive_Nuclear-Test-Ban_Treatylink
null7851504other-emptyComprehensive_Nuclear-Test-Ban_Treatyother
3003272785110Threshold_Test_Ban_TreatyComprehensive_Nuclear-Test-Ban_Treatylink
53136785120Weapon_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
589015785115United_States_and_weapons_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
null296228769other-emptyComprehensive_Nuclear-Test-Ban_Treaty_Organizationother
1454371296228713Robinson_Crusoe_IslandComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
31446831296228720Preparatory_Commission_for_the_Comprehensive_Nuclear-Test-Ban_Treaty_OrganizationComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
30669296228710Tunguska_eventComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
null296228719other-otherComprehensive_Nuclear-Test-Ban_Treaty_Organizationother
18507109296228725List_of_intergovernmental_organizationsComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
24644042296228715List_of_United_Nations_OrganizationsComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
415292352962287122014_AAComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
7851296228744Comprehensive_Nuclear-Test-Ban_TreatyComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
null296228745other-wikipediaComprehensive_Nuclear-Test-Ban_Treaty_Organizationother
null2962287364other-googleComprehensive_Nuclear-Test-Ban_Treaty_Organizationother
null24042157134other-googleComprehensive_Peace_Accordother
36982624042157145Nepalese_Civil_WarComprehensive_Peace_Accordlink
null2404215728other-emptyComprehensive_Peace_Accordother
null14712754135other-emptyComprehensive_Peace_Agreementother
321044741471275412Sudanese_conflict_in_South_Kordofan_and_Blue_NileComprehensive_Peace_Agreementlink
132654091471275447South_Sudanese_independence_referendum,_2011Comprehensive_Peace_Agreementlink
11315371471275490Second_Sudanese_Civil_WarComprehensive_Peace_Agreementlink
37585541471275417Lost_Boys_of_SudanComprehensive_Peace_Agreementlink
null14712754565other-googleComprehensive_Peace_Agreementother
null1471275427other-wikipediaComprehensive_Peace_Agreementother
null1471275412other-yahooComprehensive_Peace_Agreementother
138851961471275415AbyeiComprehensive_Peace_Agreementlink
null1471275449other-otherComprehensive_Peace_Agreementother
null1471275414other-bingComprehensive_Peace_Agreementother
323506761471275456South_SudanComprehensive_Peace_Agreementlink
18054681471275412United_Nations_Mission_in_SudanComprehensive_Peace_Agreementlink
274211471275455SudanComprehensive_Peace_Agreementlink
null651699839other-googleComprehensive_Performance_Assessmentother
null4000445817other-googleComprehensive_Physiologyother
null42674122other-googleComprehensive_Program_for_Socialist_Economic_Integrationother
38430742674131ComeconComprehensive_Program_for_Socialist_Economic_Integrationlink
2733145null20Kano_modelComprehensive_QFDredlink
8053588893563613Food_safetyComprehensive_Reviews_in_Food_Science_and_Food_Safetylink
264062893563623KombuchaComprehensive_Reviews_in_Food_Science_and_Food_Safetylink
null893563613other-emptyComprehensive_Reviews_in_Food_Science_and_Food_Safetyother
null893563623other-googleComprehensive_Reviews_in_Food_Science_and_Food_Safetyother
null29639511168other-googleComprehensive_Rural_Health_Projectother
null2963951113other-emptyComprehensive_Rural_Health_Projectother
null4138973084other-googleComprehensive_Social_Security_Assistanceother
null4138973033other-wikipediaComprehensive_Social_Security_Assistanceother
null4138973016other-emptyComprehensive_Social_Security_Assistanceother
4590784null16Educational_Records_BureauComprehensive_Testing_Programredlink
411765354201511369Geneva_interim_agreement_on_the_Iranian_nuclear_programComprehensive_agreement_on_Iranian_nuclear_programlink
3104774201511313Iran–United_States_relationsComprehensive_agreement_on_Iranian_nuclear_programlink
2419220242015113144P5+1Comprehensive_agreement_on_Iranian_nuclear_programlink
null4201511337other-otherComprehensive_agreement_on_Iranian_nuclear_programother
null42015113325other-emptyComprehensive_agreement_on_Iranian_nuclear_programother
103409604201511310Timeline_of_the_nuclear_program_of_IranComprehensive_agreement_on_Iranian_nuclear_programlink
7218074201511394Nuclear_program_of_IranComprehensive_agreement_on_Iranian_nuclear_programlink
146534201511396IranComprehensive_agreement_on_Iranian_nuclear_programlink
null42015113108other-wikipediaComprehensive_agreement_on_Iranian_nuclear_programother
null420151132299other-googleComprehensive_agreement_on_Iranian_nuclear_programother
null4832436121other-googleComprehensive_emergency_managementother
null483243611other-bingComprehensive_emergency_managementother
4831178483243624United_States_civil_defenseComprehensive_emergency_managementother

Showing the first 1000 rows.

DataFrame in python
%py
clicksPy = sqlContext.read.parquet("/datasets/wiki-clickstream")
%py
# in Python you need to put the object int its own line like this to get the type information
clicksPy 
Out[3]: DataFrame[prev_id: int, curr_id: int, n: int, prev_title: string, curr_title: string, type: string]
%py
clicksPy.show()
+--------+-------+---+--------------------+--------------------+-----+ | prev_id|curr_id| n| prev_title| curr_title| type| +--------+-------+---+--------------------+--------------------+-----+ | 154849| 7851|104|Strategic_Arms_Li...|Comprehensive_Nuc...| link| | null| 7851|100| other-bing|Comprehensive_Nuc...|other| | null| 7851| 82| other-other|Comprehensive_Nuc...|other| |15580374| 7851| 31| Main_Page|Comprehensive_Nuc...|other| | 1455516| 7851| 21| Mercury,_Nevada|Comprehensive_Nuc...| link| | 30592| 7851|158|Partial_Nuclear_T...|Comprehensive_Nuc...| link| | 21785| 7851| 52| Nuclear_weapon|Comprehensive_Nuc...| link| | 1786856| 7851| 25| Onyx_River|Comprehensive_Nuc...|other| | 337775| 7851| 92|Nuclear_weapons_t...|Comprehensive_Nuc...| link| | 22158| 7851| 12|Nuclear_prolifera...|Comprehensive_Nuc...| link| | 594209| 7851| 13| Moruroa|Comprehensive_Nuc...|other| | 14604| 7851| 12|Foreign_relations...|Comprehensive_Nuc...|other| | 499076| 7851| 35| Force_de_dissuasion|Comprehensive_Nuc...| link| | 1838300| 7851| 10|Fissile_Material_...|Comprehensive_Nuc...|other| | 6174686| 7851| 23|India–United_Stat...|Comprehensive_Nuc...|other| | 3973438| 7851| 47|High-altitude_nuc...|Comprehensive_Nuc...| link| | 2962287| 7851| 48|Comprehensive_Nuc...|Comprehensive_Nuc...| link| | 9486| 7851| 58|List_of_internati...|Comprehensive_Nuc...|other| |17671170| 7851| 14|List_of_weapons_o...|Comprehensive_Nuc...| link| |41535782| 7851| 20|List_of_nuclear_w...|Comprehensive_Nuc...|other| +--------+-------+---+--------------------+--------------------+-----+ only showing top 20 rows

Now you can continue from the original python notebook tweeted by Michael.

Recall from the beginning of this notebook that this python databricks notebook was used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088

(watch now, if you haven't already!)

Michael Armbrust Spark Summit East

You Try!

Try to laoad a DataFrame in R from the parquet file just as we did for python. Read the docs in databricks guide first:

And see the R example in the Programming Guide:

%r
library(SparkR)

# just a quick test
df <- createDataFrame(faithful)
head(df)

Attaching package: ‘SparkR’

The following object is masked _by_ ‘.GlobalEnv’:

    setLocalProperty

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union

  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55
%r
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
clicksR <- read.df("/datasets/wiki-clickstream", source = "parquet")
clicksR # in R you need to put the object int its own line like this to get the type information
SparkDataFrame[prev_id:int, curr_id:int, n:int, prev_title:string, curr_title:string, type:string]
%r
head(clicksR)
   prev_id curr_id  n               prev_title curr_title  type
1   334751   19271 24                 Cambodia   Mongolia  link
2      737   19271 24              Afghanistan   Mongolia other
3 18603746   19271 13                  Beijing   Mongolia  link
4  7770444   19271 10  Agriculture_in_Mongolia   Mongolia  link
5  7712057   19271 12 Christianity_in_Mongolia   Mongolia  link
6 16489766   19271 11      Cities_of_East_Asia   Mongolia  link
%r
display(clicksR)
1548497851104Strategic_Arms_Limitation_TalksComprehensive_Nuclear-Test-Ban_Treatylink
null7851100other-bingComprehensive_Nuclear-Test-Ban_Treatyother
null785182other-otherComprehensive_Nuclear-Test-Ban_Treatyother
15580374785131Main_PageComprehensive_Nuclear-Test-Ban_Treatyother
1455516785121Mercury,_NevadaComprehensive_Nuclear-Test-Ban_Treatylink
305927851158Partial_Nuclear_Test_Ban_TreatyComprehensive_Nuclear-Test-Ban_Treatylink
21785785152Nuclear_weaponComprehensive_Nuclear-Test-Ban_Treatylink
1786856785125Onyx_RiverComprehensive_Nuclear-Test-Ban_Treatyother
337775785192Nuclear_weapons_testingComprehensive_Nuclear-Test-Ban_Treatylink
22158785112Nuclear_proliferationComprehensive_Nuclear-Test-Ban_Treatylink
594209785113MoruroaComprehensive_Nuclear-Test-Ban_Treatyother
14604785112Foreign_relations_of_IndiaComprehensive_Nuclear-Test-Ban_Treatyother
499076785135Force_de_dissuasionComprehensive_Nuclear-Test-Ban_Treatylink
1838300785110Fissile_Material_Cut-off_TreatyComprehensive_Nuclear-Test-Ban_Treatyother
6174686785123India–United_States_Civil_Nuclear_AgreementComprehensive_Nuclear-Test-Ban_Treatyother
3973438785147High-altitude_nuclear_explosionComprehensive_Nuclear-Test-Ban_Treatylink
2962287785148Comprehensive_Nuclear-Test-Ban_Treaty_OrganizationComprehensive_Nuclear-Test-Ban_Treatylink
9486785158List_of_international_environmental_agreementsComprehensive_Nuclear-Test-Ban_Treatyother
17671170785114List_of_weapons_of_mass_destruction_treatiesComprehensive_Nuclear-Test-Ban_Treatylink
41535782785120List_of_nuclear_weapons_tests_of_the_Soviet_UnionComprehensive_Nuclear-Test-Ban_Treatyother
null78514143other-googleComprehensive_Nuclear-Test-Ban_Treatyother
null7851218other-wikipediaComprehensive_Nuclear-Test-Ban_Treatyother
49750785122Arms_controlComprehensive_Nuclear-Test-Ban_Treatyother
346367851121996Comprehensive_Nuclear-Test-Ban_Treatylink
589108785143China_and_weapons_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
585476785113Atomic_Weapons_EstablishmentComprehensive_Nuclear-Test-Ban_Treatylink
4008217851524List_of_states_with_nuclear_weaponsComprehensive_Nuclear-Test-Ban_Treatylink
704801785113List_of_United_States_treatiesComprehensive_Nuclear-Test-Ban_Treatyother
2189647785140List_of_nuclear_weapons_testsComprehensive_Nuclear-Test-Ban_Treatyother
14533785147IndiaComprehensive_Nuclear-Test-Ban_Treatylink
10737785111French_PolynesiaComprehensive_Nuclear-Test-Ban_Treatyother
38404161785113Historical_nuclear_weapons_stockpiles_and_nuclear_tests_by_countryComprehensive_Nuclear-Test-Ban_Treatyother
215176785110InfrasoundComprehensive_Nuclear-Test-Ban_Treatyother
589091785118France_and_weapons_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
740008785146India_and_weapons_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
31446831785117Preparatory_Commission_for_the_Comprehensive_Nuclear-Test-Ban_Treaty_OrganizationComprehensive_Nuclear-Test-Ban_Treatylink
162759785111Nevada_Test_SiteComprehensive_Nuclear-Test-Ban_Treatyother
53366785111Reconnaissance_satelliteComprehensive_Nuclear-Test-Ban_Treatyother
22165785120Nuclear_disarmamentComprehensive_Nuclear-Test-Ban_Treatylink
2824536785140Peaceful_nuclear_explosionComprehensive_Nuclear-Test-Ban_Treatylink
null7851504other-emptyComprehensive_Nuclear-Test-Ban_Treatyother
3003272785110Threshold_Test_Ban_TreatyComprehensive_Nuclear-Test-Ban_Treatylink
53136785120Weapon_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
589015785115United_States_and_weapons_of_mass_destructionComprehensive_Nuclear-Test-Ban_Treatyother
null296228769other-emptyComprehensive_Nuclear-Test-Ban_Treaty_Organizationother
1454371296228713Robinson_Crusoe_IslandComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
31446831296228720Preparatory_Commission_for_the_Comprehensive_Nuclear-Test-Ban_Treaty_OrganizationComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
30669296228710Tunguska_eventComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
null296228719other-otherComprehensive_Nuclear-Test-Ban_Treaty_Organizationother
18507109296228725List_of_intergovernmental_organizationsComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
24644042296228715List_of_United_Nations_OrganizationsComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
415292352962287122014_AAComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
7851296228744Comprehensive_Nuclear-Test-Ban_TreatyComprehensive_Nuclear-Test-Ban_Treaty_Organizationlink
null296228745other-wikipediaComprehensive_Nuclear-Test-Ban_Treaty_Organizationother
null2962287364other-googleComprehensive_Nuclear-Test-Ban_Treaty_Organizationother
null24042157134other-googleComprehensive_Peace_Accordother
36982624042157145Nepalese_Civil_WarComprehensive_Peace_Accordlink
null2404215728other-emptyComprehensive_Peace_Accordother
null14712754135other-emptyComprehensive_Peace_Agreementother
321044741471275412Sudanese_conflict_in_South_Kordofan_and_Blue_NileComprehensive_Peace_Agreementlink
132654091471275447South_Sudanese_independence_referendum,_2011Comprehensive_Peace_Agreementlink
11315371471275490Second_Sudanese_Civil_WarComprehensive_Peace_Agreementlink
37585541471275417Lost_Boys_of_SudanComprehensive_Peace_Agreementlink
null14712754565other-googleComprehensive_Peace_Agreementother
null1471275427other-wikipediaComprehensive_Peace_Agreementother
null1471275412other-yahooComprehensive_Peace_Agreementother
138851961471275415AbyeiComprehensive_Peace_Agreementlink
null1471275449other-otherComprehensive_Peace_Agreementother
null1471275414other-bingComprehensive_Peace_Agreementother
323506761471275456South_SudanComprehensive_Peace_Agreementlink
18054681471275412United_Nations_Mission_in_SudanComprehensive_Peace_Agreementlink
274211471275455SudanComprehensive_Peace_Agreementlink
null651699839other-googleComprehensive_Performance_Assessmentother
null4000445817other-googleComprehensive_Physiologyother
null42674122other-googleComprehensive_Program_for_Socialist_Economic_Integrationother
38430742674131ComeconComprehensive_Program_for_Socialist_Economic_Integrationlink
2733145null20Kano_modelComprehensive_QFDredlink
8053588893563613Food_safetyComprehensive_Reviews_in_Food_Science_and_Food_Safetylink
264062893563623KombuchaComprehensive_Reviews_in_Food_Science_and_Food_Safetylink
null893563613other-emptyComprehensive_Reviews_in_Food_Science_and_Food_Safetyother
null893563623other-googleComprehensive_Reviews_in_Food_Science_and_Food_Safetyother
null29639511168other-googleComprehensive_Rural_Health_Projectother
null2963951113other-emptyComprehensive_Rural_Health_Projectother
null4138973084other-googleComprehensive_Social_Security_Assistanceother
null4138973033other-wikipediaComprehensive_Social_Security_Assistanceother
null4138973016other-emptyComprehensive_Social_Security_Assistanceother
4590784null16Educational_Records_BureauComprehensive_Testing_Programredlink
411765354201511369Geneva_interim_agreement_on_the_Iranian_nuclear_programComprehensive_agreement_on_Iranian_nuclear_programlink
3104774201511313Iran–United_States_relationsComprehensive_agreement_on_Iranian_nuclear_programlink
2419220242015113144P5+1Comprehensive_agreement_on_Iranian_nuclear_programlink
null4201511337other-otherComprehensive_agreement_on_Iranian_nuclear_programother
null42015113325other-emptyComprehensive_agreement_on_Iranian_nuclear_programother
103409604201511310Timeline_of_the_nuclear_program_of_IranComprehensive_agreement_on_Iranian_nuclear_programlink
7218074201511394Nuclear_program_of_IranComprehensive_agreement_on_Iranian_nuclear_programlink
146534201511396IranComprehensive_agreement_on_Iranian_nuclear_programlink
null42015113108other-wikipediaComprehensive_agreement_on_Iranian_nuclear_programother
null420151132299other-googleComprehensive_agreement_on_Iranian_nuclear_programother
null4832436121other-googleComprehensive_emergency_managementother
null483243611other-bingComprehensive_emergency_managementother
4831178483243624United_States_civil_defenseComprehensive_emergency_managementother

Showing the first 1000 rows.