Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research

This project aims to enhance JISC's UK Web Domain archive, a 30 TB archive of the .uk country-code top level domain collected from 1996 to 2010. It will extract link graphs from the data and disseminate social science research using the collection.

Contact:

Professor Helen Margetts

Tel: +44 (0)1865 287207

Email: helen.margetts@oii.ox.ac.uk

Overview

The potential of web archives for link analysis research has been well documented, but this potential has yet to be realised and demonstrated in good research. There is therefore a need to start the hands-on work in processing and analysing domain-scale web collections. This project aims to increase visibility, accessibility, and ease-of-use of the JISC UK Web Domain Dataset, a 30 terabyte web archive of the .uk country-code top level domain (ccTLD) collected from 1996 to 2010. The project will extract link graphs from the data, assess the feasibility and impact of using the .uk ccTLD as a boundary for UK web presence, and conduct and disseminate high-quality social science research examples using the collection. It will also trial tools and procedures to make the data more easily accessible including tools for remote access and assessing the feasibility of developing code to allow the easy import of link data from the collection into NodeXL or other network data analysis software packages to allow for easy access, visualisation, and analysis of subsets of the corpus.

Background

The current and transient nature of the Web means that new information replaces older information constantly without any record of the previous state (or versions) of the same information. While new information is being added, existing information also disappears from the Web, leaving a significant gap in our knowledge of the historical web, and potentially in social history and our understanding of change over time.

Serious web archiving effort started in the UK in 2004 through the UK Web Archiving Consortium, a collaborative project among a number of organisations including the National Libraries, the National Archives, JISC, and the Wellcome Library. Comprehensive collection of websites has not been possible in the UK due the lack of a national legislative framework, which in a number of other countries has allowed the national libraries to archive publicly available web publications through periodic crawls of the national domain. The UK Web Archive and the UK Government Web Archive are currently the two main archives in the UK, containing archival copies of only a highly selective fraction of the UK websites.

JISC recently funded the procurement of the UK Web Domain Dataset, a research dataset of UK websites from the Internet Archive spanning the period 1996-2010. This 'national copy' of UK web collection is currently managed by the British Library on behalf of JISC. This project involves an exploration of the social science potential of this dataset. It will develop the tools to conduct social science analyses of key subsets of the data, by using webmetric analyses to look at online structures and the spread of policy issues across websites. The dataset was only procured recently (late 2011): this project is therefore one of the first to explore and understand this dataset.

Link Analysis and Political Science Questions

Underneath all interaction and content on the web are links. Hyperlinks connect web pages and web sites, links connect users of Twitter and other social media sites, and links connect users with corporations, organisations, and government entities. Link analysis can therefore reveal which entities are central in content discussions and which users play important bridging roles connecting otherwise separate groups. Analysis can reveal fractions between groups online and areas of cohesion, the size of entities and the relationship between entities. Internet archives allow all of these dimensions to be analysed over time.

For given domains, link analysis can therefore be used to understand the structures of online institutions, their relationships between each other and their interactions with the outside world, as well as their navigability for users. For the UK government second-level domain (encompassing all sites ending in .gov.uk), link analysis can be used to analyse the changing structure of, and relationships between, government departments and agencies; their place in social and informational networks including the extent to which they are 'nodal', levels and locations of citizen-government interaction; and government's position as watchtower with a privileged view of UK society. In addition, "government on the web" as experienced by citizens online can be documented. Such a full picture of the UK government domain has never been drawn before, although a snapshot of one point in time of central government was created by members of the same research team (Dunleavy, Margetts et al., 2007). Using an archival dataset provides a unique opportunity to assess how the structure of e-government (the only part of government that most people interact with) has changed from its conception to the current day.

Datasets

This project will primarily use and enhance the JISC UK Web Domain Dataset (1996-2010). This dataset is superior to alternatives due to its comprehensiveness in content and extended temporal dimension. Alternative datasets such as the UK Government Web Archive at the National Archives include only selective parts of government presence (i.e. only central government) and because of the focus on only capturing government pages cannot shed light on how sites in other domains (.co.uk, .org.uk, etc.) link to government websites. The JISC UK Web Domain Dataset allows for studies of the entire government presence including local governments, analysis of how other websites link to (and from) government pages, giving a picture of government's place in social and informational networks across the UK, and how these patterns change over an extended period of time.

Additional datasets offer great opportunities for comparison, over time and across countries. The principal investigator (Helen Margetts) currently directs an ESRC-funded research programme which involves crawling the web presence of government in seven countries (the US, Canada, Germany, Japan, Australia, the Netherlands and the UK) but no overtime analysis. Data from an independent crawl by the Oxford Internet Institute in late 2011 of sites in the .gov.uk domain for the ESRC programme offers the opportunity to document the change in government presence from 2010 (when the JISC UK Web Domain Dataset ends) to the present time, as well as introducing a cross-national comparative element.

Methodology

This project will primarily use and enhance the JISC .uk domain web dataset, a 30 terabyte crawl of the .uk country-code top level domain (ccTLD) maintained by the British Library. The data was collected from 1996 to 2010 and holds the raw HTML of all webpages harvested. This project will enhance the sustainability of the collection by:

  • Extracting all hypertext links from the currently unstructured text of the pages and making it easier to directly query this link data.

  • Investigating and recommending access and tools for querying, analysing, and visualising the dataset to enable wider use. This will include assessing the feasibility of making the data accessible in NodeXL or another popular network open-source software package for analysing graph/link data.

  • Assessing the impact of using the .uk ccTLD as boundary for the UK web presence.

  • Developing recommendations for future web crawls that will be used for link analysis.

  • Demonstrating the value of this collection and a 'big data' approach through novel and innovative political science research using the collection.

Support

This project is supported by JISC.

Sponsors

People

Project Lead

Researchers

Blog