We recently announced the start of an exciting new research project that will involve the use of topic modelling in understanding the patterns in submitted stories to the Everyday Sexism website. Here, we briefly explain our text analysis approach, “topic modelling”.
At its very core, topic modelling is a technique that seeks to automatically discover the topics contained within a group of documents. ‘Documents’ in this context could refer to text items as lengthy as individual books, or as short as sentences within a paragraph. Let’s take the idea of sentences-as-documents as an example:
- Document 1: I like to eat kippers for breakfast.
- Document 2: I love all animals, but kittens are the cutest.
- Document 3: My kitten eats kippers too.
Assuming that each sentence contains a mixture of different topics (and that a ‘topic’ can be understood as a collection of words (of any part of speech) that have different probabilities of appearance in passages discussing the topic), how does the topic modelling algorithm ‘discover’ the topics within these sentences?
The algorithm is initiated by setting the number of topics that it needs to extract. Of course, it is hard to guess this number without having an insight on the topics, but one can think of this as a resolution tuning parameter. The smaller the number of topics is set, the more general the bag of words in each topic would be, and the looser the connections between them.
The algorithm loops through all of the words in each document, assigning every word to one of our topics in a temporary and semi-random manner. This initial assignment is arbitrary and it is easy to show that different initializations lead to the same results in long run. Once each word has been assigned a temporary topic, the algorithm then re-iterates through each word in each document to update the topic assignment using two criteria: 1) How prevalent is the word in question across topics? And 2) How prevalent are the topics in the document?
To quantify these two, the algorithm calculates the likelihood of the words appearing in each document assuming the assignment of words to topics and topics to documents.
Of course words can appear in different topics and more than one topic can appear in a document. But the iterative algorithm seeks to maximize the self-consistency of the assignment by maximizing the likelihood of the observed word-document statistics.
We can illustrate this process and its outcome by going back to our example. A topic modelling approach might use the process above to discover the following topics across our documents:
- Document 1: I like to eat kippers for breakfast. [100% Topic A]
- Document 2: I love all animals, but kittens are the cutest. [100% Topic B]
- Document 3: My kitten eats kippers too. [67% Topic A, 33% Topic B]
Topic modelling defines each topic as a so-called ‘bag of words’, but it is the researcher’s responsibility to decide upon an appropriate label for each topic based on their understanding of language and context. Going back to our example, the algorithm might classify the underlined words under Topic A, which we could then label as ‘food’ based on our understanding of what the words mean. Similarly the italicised words might be classified under a separate topic, Topic B, which we could label ‘animals’. In this simple example the word “eat” has appeared in a sentence dominated by Topic A, but also in a sentence with some association to Topic B. Therefore it can also be seen as a connector of the two topics. Of course animals eat too and they like food!
We are going to use a similar approach to first extract the main topics reflected on the reports to the Everyday Sexism Project website and extract the relation between the sexism-related topics and concepts based on the overlap between the bags of words of each topic. Finally we can also look into the co-appearance of topics in the same document. This way we try to draw a linguistic picture of the more than 100,000 submitted reports.
As ever, be sure to check back for further updates on our progress!
Note: This post was originally published on the Policy & Internet blog on . It might have been updated since then in its original location. The post gives the views of the author(s), and not necessarily the position of the Oxford Internet Institute.