The course will teach computational complexity, how to profile Python code, and increase the computational efficiency of Python code. It will also cover parallel and distributed computing approaches including issues such as race conditions and discuss data storage and retrieval techniques (including SQL and NoSQL). The course includes lab sessions in which students will gain hands-on experience to be able to aptly handle large-scale, heterogeneous data on a server and be able to reduce, transform, and otherwise manipulate the data in order to answer a social science question

Key Themes

  • Unix terminal basics (SSH, error handling, cron, logging)
  • Computational limits, computational complexity, Big-O notation, and profiling code
  • Parallelization, race conditions, and distributed computing
  • MapReduce and PySpark
  • Data storage techniques
  • Cython

Learning Objectives

At the end of this course students will:

  • Design and execute a long-running process to capture data implementing appropriate log-ging, error handling, and scheduling
  • Understand common limits on wrangling data (memory, disk, CPU) and be able to devise an analysis plan to analyse data taking these limits into account as well as profile code to identify inefficiencies
  • Understand the MapReduce approach and be able to develop an analysis plan using multi-ple MapReduce cycles to be able to answer a basic research question
  • Be able to implement and execute the filtering, aggregation, and extraction of data from a large-scale, heterogeneous dataset using Hadoop, Spark, or another appropriate tool.
This page was last modified on 3 September 2019