Skip down to main content

Web archives as big data

Published on
27 Jan 2015
Written by
Eric T. Meyer

Peter Webster, a member of the project team, here reflects on the conference we held at the IHR on 3 December. Peter writes:

In early December the project held an excellent day conference on the theme of web archives as big data. A good part of the day was taken up with short presentations from eight of our bursary holders, reflecting both on the substantive research findings they have achieved, and also on the experience of using the SHINE interface and on web archives as source material in general.

In early 2015 these results will appear on this blog as a series of reports, one from each bursary holder. So, rather than attempt to summarise each presentation in turn, this post reflects on some common methodological themes that emerged during the course of the day.

Perhaps the single most prominent note of the whole day was of the sheer size of the archive. “Too much data!” was a common cry heard during the project, and with good reason, since there are few other archives in common use with data of this magnitude, at least amongst those used by humanists. In an archive with more than 2 billion resources recorded in the index, the researchers found that queries needed to be a great deal more specific than most users are accustomed to; and that even the slightest ambiguity in the choice of search terms in particular led very quickly to results sets containing many thousands of results. Gareth Millward also drew attention to the difficulties in interpreting patterns in the incidence of any but the most specific search terms across time across the whole dataset, since almost any search term a user can imagine may have more than one meaning in an archive of the whole UK web.

One common strategy to come to terms with the size of the archive was to “think small”: to explore some very big data by means of a series of small case studies, which could then be articulated together. Harry Raffal, for example, focused on a succession of captures of a small set of key pages in the Ministry of Defence’s web estate; Helen Taylor on a close reading of the evolution of the content and structure of certain key poetry sites as they changed over time. This approach had much in common with that of Saskia Huc-Hepher on the habitus of the London French community as reflected in a number of key blogs. Rowan Aust also read important things from the presence and absence of content in the BBC’s web estate in the wake of the Jimmy Saville scandal.

An encouraging aspect of the presentations was the methodological holism on display, with this particular dataset being used in conjunction with other web archives, notably the Internet Archive. In the case of Marta Musso’s work on the evolution of the corporate web space, this data was but one part of a broader enquiry employing questionnaire and other evidence in order to create a rounded picture.

One particular and key difference between the SHINE interface and other familiar services is that search results in SHINE are not prioritised by any algorithmic intervention, but are presented in the archival order. This brought into focus one of the recurrent questions in the project: in the context of superabundant data, how attached is the typical user to a search service that (as it were) second-guesses what it was that the user *really* wanted to ask, and presents results in that order? If such a service is what is required, then how transparent must the operation of the algorithm be in order to be trusted ? Richard Deswarte powerfully drew attention to how fundamental has been the effect of Google on user expectations of the interfaces they use. Somewhat surprisingly (at least for me), more than one of the speakers was prepared to accept results without such machine prioritisation: indeed, in some senses it was preferable to be able to utilise what Saskia Huc-Hepher described as the “objective power of arbitrariness”. If a query produced more results than could be inspected individually, then both Saskia and Rona Cran were more comfortable with making their own decisions about taking smaller samples from those results than relying on a closed algorithm to make that selection. In a manner strikingly akin to the functionality of the physical library, such arbitrariness also led on occasion to a creative serendipitous juxtaposition of resources: a kind of collage in the web archive.

Privacy Overview
Oxford Internet Institute

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies
  • moove_gdrp_popup -  a cookie that saves your preferences for cookie settings. Without this cookie, the screen offering you cookie options will appear on every page you visit.

This cookie remains on your computer for 365 days, but you can adjust your preferences at any time by clicking on the "Cookie settings" link in the website footer.

Please note that if you visit the Oxford University website, any cookies you accept there will appear on our site here too, this being a subdomain. To control them, you must change your cookie preferences on the main University website.

Google Analytics

This website uses Google Tags and Google Analytics to collect anonymised information such as the number of visitors to the site, and the most popular pages. Keeping these cookies enabled helps the OII improve our website.

Enabling this option will allow cookies from:

  • Google Analytics - tracking visits to the ox.ac.uk and oii.ox.ac.uk domains

These cookies will remain on your website for 365 days, but you can edit your cookie preferences at any time via the "Cookie Settings" button in the website footer.