About
This course provides a comprehensive, hands-on introduction to modern data collection, cleaning, analysis, and basic modeling techniques using Python. Over the span of eight weeks, participants will learn how to retrieve data from diverse online sources such as RSS feeds, Reddit, and Wikipedia through APIs, as well as how to extract structured information from webpages. They will gain experience working with survey data and external files, processing and cleaning text data, and identifying meaningful keywords and sentiments. Students will also learn how to develop reliable metrics, visualize data to reveal patterns, and perform introductory statistical tests and regression analyses to gain deeper insights.
Learning objectives
- Data Acquisition: Learn to access and retrieve data from various online sources, including RSS feeds, websites, and APIs, to build robust datasets for analysis.
- Data Preparation: Develop skills to clean, organise, and preprocess both numerical and textual data, ensuring datasets are analysis-ready.
- Textual Analysis: Understand how to process text to extract key concepts, identify sentiment, and prepare textual information for downstream analysis.
- Metric Development: Learn to create meaningful and reliable metrics to quantitatively describe and evaluate datasets.
- Visualisation and Pattern Recognition: Gain proficiency in using Python libraries for data visualisation and exploratory data analysis to uncover patterns and trends.
- Basic Statistical Analysis: Perform fundamental statistical tests and implement simple regression models to derive actionable insights and support data-driven decisions.
This course will consist of weekly lectures during Hilary term. Each class is divided into two parts: a shorter, lecture style substantive session where key themes and concepts for the week are explored, and a longer ‘lab’ session where students will be assigned a number of practical programming challenges. Code samples and files for use in these practical sessions will be distributed before class. Data sets, code and problems will be distributed ahead of time.
Class style: Part lecture, part interactive programming lab to build technical skills
Mode of course delivery: one hour technical lesson, one hour discussion and programming lab. Flexible format to accommodate different required timings.
Additional TA sessions: Weekly drop-in Q&A or surgery
Weekly Topics
- Week 1 – Extracting extracting information (RSS feeds)
- Week 2 – Retrieving data from online forums (Reddit/API)
- Week 3 – Reading survey data and external files
- Week 4 – Scraping data
- Week 5 – Cleaning data, finding keywords, and sentiments
- Week 6 – Creating good and reliable metrics
- Week 7 – Showing patterns in data
- Week 8 – Performing simple tests and regression analysis
Prerequesites: It is expected that students will have taken Introduction to Python, or can demonstrate equivalent experience.