“This is an early-stage preparing course for those going on to do some form of data science. It allows students to go on to courses in visualisation and big data analysis with a solid foundation, that will enable them to do much more than they would otherwise.”

Big Data tends to be dirty data. It is rare that a researcher has privileged access to a data set that is in the correct form for analysis. Years of past experience in guiding students has demonstrated that the act of cleaning and shaping data remains one of the most significant hurdles in moving from idea to execution. This course will familiarize the student with a variety of techniques for cleaning and shaping data. We will move through the acts parsing text files, cleaning json, aggregating data in a database and reading from that database.

This course will not be directly exploring substantive questions with the data and thus should be paired with a substantive course of interest to the student. Instead we will focus almost exclusively on the skills required to manage data across contexts. Similarly, this course will not provide the students with the tools to analyse or visualize data. While this might appear to be a significant oversight, it is often the case that visualization and analysis are rather straightforward once data has been cleaned. As such, this course focuses on the difficult chasm between data access and data analysis that is often overlooked.

The purpose of this course is to familiarize the student with the variety of approaches for processing pre-collected data, a technique colloquially referred to as ‘data wrangling’. This involves filtering, shaping, and preparing data for analysis. Each week the student is given a short lecture, a data set and a series of transformations that the data should go through. The remainder of class will be focused on helping the student progress through the task.

Outcomes: By the end of the course, students will: Be parsing text files in a manner suitable for Natural Language Processing; Be reshaping json data such as tweets, Facebook statuses and other API-based data into rectangular structures amenable to analysis; Understand how to apply regular expressions to string text; Appreciate, understand and tame Unicode data such as © and ™.

This page was last modified on 15 March 2017