Language is a rich source of information about people’s knowledge and opinions, and also about their goals, social identities, and social alliances. Large-scale analysis of language data is invaluable for research in the social sciences. In this option/elective, students will learn to use statistical natural language processing tools that will enable them to analyse the content of documents in order to address a wide variety of different questions.

This course will develop conceptual and technical tools for large-scale analysis of linguistic data such as document collections, transcripts, and blogs. The first part of the course introduces the statistical structure of the lexicon and models for text creation, including the baseline Naïve Bag of Words model as well as more realistic models that include effects of social and pragmatic context. Then, we turn to algorithms for clustering, classifying, and discriminating different types of documents on the basis of the words and word sequences that they contain.  These are applied to characterise the topics of different documents as well as the socio-indexical traits of speakers/authors. Lastly, we bring these ideas together in tools for analysing the spread of memes and opinions through repeated interactions in linguistic communities.

Learning Outcomes

At the end of the course, students will:

  • Understand the basic statistical properties of words and word sequences in natural language, and the typical challenges they present.
  • Learn how topical and social context affect word frequencies at different time scales.
  • Learn methods for document clustering and classification.
  • Acquire tools for analysing lexical dynamics to answer questions about conceptual and social trends.
This page was last modified on 3 October 2018