I have a new draft paper out with my colleague Tom Nicholls, entitled Understanding news story chains using information retrieval and network clustering techniques. In it we address what we perceive as an important technical challenge in news media research, which is how to group together articles that all address the same individual news event. This challenge is unmet by most current approaches in unsupervised machine learning as applied to the news, which tend to focus on the broader (also important!) problem of grouping articles in topic categories. It is in general a difficult problem, as we are looking for what are typically small “chains” of content on the same event (e.g. four or five different articles) amongst a corpus of tens of thousands of articles, most of which are unrelated to each other.
Our approach makes use of algorithms and insight drawn from the fields of both information retrieval [IR] and network clustering to develop a novel unsupervised method of news story chain detection. IR techniques (which are used to build things like search engines) especially haven’t been much employed in the social sciences, where the focus has more been on machine learning. But these algorithms were much closer to our problem as connecting small amounts of news stories is quite similar to the task of searching a huge corpus of documents in response to a specific user query.
The resulting algorithm works pretty well, though it is very difficult to validate properly because of the nature of the data! We use it to pull out a couple of interesting first order descriptive statistics about news stories in the UK, for example the graphic above shows the typical evolution of news stories after the publication of an initial article.
Just a draft at the moment so all feedback welcome!