Skip down to main content

IHR workshop on web archiving

Published on
12 Nov 2015
Written by
Eric T. Meyer

On 11 November the IHR held a workshop, ‘An Introduction to Web Archiving for Historians‘, for which we welcomed back two old friends from the BUDDAH project as speakers.

The day opened with Jane Winters talking about why historians should be using web archives. You can see the slides of Jane’s talk here, including a couple courtesy of a fascinating presentation from Andy Jackson about the persistence of live web pages. This was followed by Helen Hockx-Yu, formerly of the British Library’s web archiving team but now at the Internet Archive. Helen described the Internet Archive’s mind-boggling scale and its ambitious plans for the future; Helen’s slides are here. Jane then returned to talk about the UK Government Web Archive and Parliament Web Archive (more slides here).

After having heard about various web archives, attendees were able to try the Shine interface for themselves. This is an interface to a huge archive – the UK’s web archive covering 1996 to 2013 – all now searchable as full text. Shine was one of the major outputs of the BUDDAH project and we were delighted to see how fast and responsive it now is, thanks to the continuing work of the web archiving team at the British Library.

Before lunch there was time for Marty Steer to lead a quick canter through the command line tool wget. Marty explained how flexible this tool is for downloading web pages or whole sites (and the importance of using the settings provided to avoid hammering sites with a blizzard of requests). You can even use wget to create complete WARC files. Marty’s presentation, with all of the commands used, can be read here.

After lunch Rowan Aust of Royal Holloway described her research on the BBC’s reaction to the Jimmy Savile scandal and how it has removed Savile from instances of its web and programme archives. Rowan’s earlier account of the research, written for the BUDDAH project, is on our institutional repository.

Then it was back to the command line, as Jonathan Blaney explained how easy it is to interrogate very large text files by typing a few lines of text. On Mac and Linux machines a fully-featured ‘shell’, bash, is provided by default; for this session using Windows Jonathan had installed ‘Git bash’, a free, lightweight version of bash (there are useful installation instructions here). The group looked at links to external websites in the Shine dataset, using a sample of a random 0.01% of the full link file; this still amounted to about 1.5 million lines (the full file, at 19GB, can be downloaded from the UKWA site). The main command used for this hands-on session was grep, a powerful yet simple search utility which is ideal for searching very large files or numbers of files.

The day ended with the group using webrecorder.io, a free online tool which allows the archiving of web pages through a simple and intuitive interface.

We’d like to thank everyone who came to the workshop: this was the first time we had run such an event on web archiving and their enthusiastic participation and constructive feedback have given us the confidence to run this course again in the future.

Privacy Overview
Oxford Internet Institute

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies
  • moove_gdrp_popup -  a cookie that saves your preferences for cookie settings. Without this cookie, the screen offering you cookie options will appear on every page you visit.

This cookie remains on your computer for 365 days, but you can adjust your preferences at any time by clicking on the "Cookie settings" link in the website footer.

Please note that if you visit the Oxford University website, any cookies you accept there will appear on our site here too, this being a subdomain. To control them, you must change your cookie preferences on the main University website.

Google Analytics

This website uses Google Tags and Google Analytics to collect anonymised information such as the number of visitors to the site, and the most popular pages. Keeping these cookies enabled helps the OII improve our website.

Enabling this option will allow cookies from:

  • Google Analytics - tracking visits to the ox.ac.uk and oii.ox.ac.uk domains

These cookies will remain on your website for 365 days, but you can edit your cookie preferences at any time via the "Cookie Settings" button in the website footer.