OII | Search results for historical material

Published on
22 Oct 2014

This is a guest post by Jaspreet Singh, a researcher at the L3S Research Center in Hanover. Jaspreet writes:

When people use a commercial search engine to search for information, they represent their intent using a set of keywords. In most cases this is to quickly look up a piece of information and move on to the next task. For scholars however, the information intent is usually very different from the casual user and often hard to express as keywords. The fact that the advanced query feature of the BL’s web archive search engine is quite popular is strong evidence to suggest this.

By working closely with scholars though we can gain better insights into their search intents and design the search engine accordingly. In my master thesis I focus specifically on search result ranking when the user search intent is historical.

Let us consider the user intent, ‘I want to know the history of Rudolph Giuliani, the ex-mayor of New York City’. We can safely assume that history refers to the important time periods and aspects of Rudolph Giuliani’s life. The user would most likely input the keywords ‘rudolph giuliani’ and expect to see a list of documents that give him a general overview of Giuliani’s major historically relevant facts. From here the user can modify his query of filter the results using facets to dig deeper into certain aspects. A standard search engine however is unaware of this intent. It only receives keywords as input and tries to serve the most relevant documents of the user.

At the L3S Research Center we have developed a prototype search engine specifically for historical search intents. We use temporal and aspect based search result diversification techniques to serve users with documents which cover a topic’s most important historical facts within the top n results. For example, when searching for ‘rudolph giuliani’ we try to retrieve documents that cover his election campaigns, his mayoralty, his run for senate and his personal life so that the user gets a quick gist of the important facts. Using our system, the user can explore the results by time using an interactive timeline or modify the query. The prototype showcases the various state of the art algorithms used for search diversification as well as our own algorithm, ASPTD. We use the New York Times 1987-2007 news archive as our corpus of study. In the interface we present only the top 30 results at a time.

In the future, we plan to test our approach on a much larger news archive like the 100 year London Times corpus. We also intend to strengthen the algorithm to work with web archives and work with the BL to integrate such methods in the current BL web archive search system so that users can explore the archive better.

Link to the system: http://pharos.l3s.uni-hannover.de:7080/ArchiveSearch/starterkit/.

Related Project

Big UK Domain Data for the Arts and Humanities

The Big UK Domain Data for the Arts and Humanities project works with data derived from the UK domain crawl from 1996 to 2013, in order to develop a framework for the study of web archive data and produce a major history of the UK web space.