The course will teach computational complexity and Big-O notation, cover parallel and distributed computing approaches including issues such as race conditions (with lab-sessions in which students will gain hands-on experience with MapReduce, Hadoop, and Spark), and discuss data storage and retrieval techniques (including SQL and NoSQL). The goal will be to aptly handle large-scale, heterogeneous data on a server and be able to reduce, transform, and otherwise manipulate the data in order to answer a social science question.

Key Themes

  •  Long-term data capture (SSH, error handling, cron, logging)
  • Computational limits, computational complexity, Big-O notation, and profiling code
  • Parallelization, race conditions, and distributed computing
  • MapReduce
  • Spark
  • NoSQL and SQL data stores (MongoDB, SQL/sharding, Hive: UDFs in Java)
  • Cython and bindings for c-libraries
  • GPU computing

Learning Objectives

At the end of this course students will…

  • Design and execute a long-running process to capture data implementing appropriate logging, error handling, and scheduling
  • Understand common limits on wrangling data (memory, disk, CPU) and be able to devise an analysis plan to analyse data taking these limits into account as well as profile code to identify inefficiencies
  • Understand the MapReduce approach and be able to develop an analysis plan using multiple MapReduce cycles to be able to answer a basic research question
  • Be able to implement and execute the filtering, aggregation, and extraction of data from a large-scale, heterogeneous dataset using Hadoop, Spark, or another appropriate tool.
  • Understand the differences and relative merits of SQL and NoSQL data stores and be able to choose an appropriate data storage approach for a given research question and data source.
  • Be able to write and execute basic SQL queries for MySQL in Python and be able to capture the results into Python variables for further manipulation
  • Be able to write and execute basic queries for MongoDB (a NoSQL data store) in Python and be able to capture the results into Python variables for further manipulation
This page was last modified on 16 January 2019