Big Data Analysis for Humanities and Social Sciences

August 26, 2016, King’s Digital Lab, King’s College London

prepared by Raazesh Sainudiin

supported by


This event is hosted by King’s Digital Lab.


The workshop will be led by Raaz Sainudiin. Raaz completed a PhD in Statistics at Cornell University in 2005 and was a Research Fellow of the Royal Commission for the Exhibition of 1851 at the Statistics Department of Oxford University until 2007. He is currently a Senior Lecturer in the School of Mathematics and Statistics at University of Canterbury, Christchurch, NZ. His recent excursions into scalable data science is funded by databricks academic partners program.


This workshop will introduce elements of Scalable Data Science for humanities and social science researchers using Apache Spark over a Databricks shard. It will guide attendees through hands-on analysis of US State of the Union addresses, Wikipedia click-streams, live Tweets, and the Old Bailey Online dataset.

What we’ll do

The workshop will introduce the basics of the in-memory distributed computing framework Apache Spark, including basic map-reduce operations via Spark’s resilient distributed datasets (RDDs) for a word-count of US State of the Union addresses (first 40 minutes), data exploration via no-sql queries using Spark’s dataframes for Wiki click-streams (30-40 minutes), Spark-streaming for filtering and getting top hash-tags of live tweets (30-40 minutes) and finally the loading, xml-parsing and the beginnings of exploration of the Old Bailey Online dataset (40 minutes, including discussions). There will be a 20 minute break during the workshop.

Who should attend?

Researchers in the humanities and social sciences who would like an introduction to big data analysis, using industry-standard tools. The workshop will be technical, and best suited to people with a good grasp of programming. More advanced users will be able to extend themselves. Non-programmers interested in seeing ‘under the hood’ of data analysis, perhaps in order to collaborate more effectively with technical colleagues, are also welcome.

What you need to bring and do

Bring a laptop if you have one. Access to eduroam and The Cloud will be available. Ideally you will have signed up for a Databricks Community Edition account before the day so you can follow along.

Please get on the waiting list for Databricks Community Edition as soon as possible:


* Friday, 26 August 2016 from 09:00 to 12:00 (BST) 


* Virginia Woolf Building room 1.34 - 22 Kingsway, London, WC2B 6LE, United Kingdom - View Map

If you want to self-learn then,