14 Aug 2014
Continuing with our series of blog posts exposing the workings behind a multidisciplinary big data project, we talk this week about the process of moving between small data and big data analyses. Last week, we did a group deep dive into our data. Extending the metaphor: Shilad caught the fish and dumped them on the boat for us to sort through. We wanted to know whether our method of collecting and determining the origins of the fish was working by looking at a bunch of randomly selected fish up close. Working out how we would do the sorting was the biggest challenge. Some of us liked really strict rules about how we were identifying the fish. ‘Small’ wasn’t a good enough description; better would be that small = 10-15cm diameter after a maximum of 30 minutes out of the water. Through this process we learned a few lessons about how to do this close-looking as a team.
Step 1: Randomly selecting items from the corpus
We wanted to know two things about the data that we were selecting through this ‘small data’ analysis: Q1) Were we getting every citation in the article or were we missing/duplicating any? Q2) What was the best way to determine the location of the source?
Shilad used the WikiBrain software library he developed with Brent to identify all roughly one million geo-tagged Wikipedia articles. He then collected all external URLs (about 2.9 million unique URLs) appearing within those articles and used this data to create two samples for coding tasks. He sampled about 50 geotagged articles (to answer Q1) and selected a few hundred random URLs cited within particular articles (to answer Q2).
- Batch 1 for Q1: 50 documents each containing an article title, url, list of citations, empty list of ‘missing citations’
- Batch 2 for Q2: Spreadsheet of 500 random citations occurring in 500 random geotagged articles.
Example from batch 1:
|Coding for Montesquiu
Initials of person who coded this:
Example from batch 2:
|url||domain||effective domain||article||article url|
|books.google.ca||google.ca||Teatro Calderón (Valladolid)||http://en.wikipedia.org/
For batch 1, we looked up each article and made sure that the algorithm we were using was catching all the citations. We found that there were a few anomalies where there was a duplication of citations (for example, when a single citation contained two urls: one to the ISBN address and another to a Google books url) or when we were missing citations (when the API was only listing a URL once when it had been used multiple times or when a book was cited without a url, for example) or when we were getting incorrect citations (when the citation url pointed to the Italian National Institute of Statistics (Istat) article on Wikipedia rather than the Istat domain).
The town of El Bayad in Libya contained two citations that weren’t included in the analysis because they didn’t contain a url, for example. One appears to be a newspaper and the other a book, but I couldn’t find the citations online. These would not be included in the analysis but it was the only example like this:
- Amraja M. el Khajkhaj, “Noumou al Mudon as Sagheera fi Libia”, Dar as Saqia, Benghazi-2008, p.120.
- Al Ain newspaper, Sep. 26; 2011, no. 20, Dar al Faris al Arabi, p.7.
We listed each of these anomalies in order to work out a) whether we can accommodate them in the algorithm or whether b) there are so few of them that they probably won’t affect the analysis too heavily.
Step 2: Developing a codebook and initial coding
I took the list of 500 random citations in batch 2 and went through each one to develop a new list of 100 working URLs and a codebook to help the others code the same list. I discarded 24 dead links and developed a working definition for each code in the codebook.
The biggest challenge when trying to locate citations in Wikipedia is whether to define the location according to the domain that is being pointed to, or whether one should find the original source. Google books urls are the most common form of this challenge. If a book is cited and the url points to its Google books location, do we cite the source as coming from Google or from the original publisher of the work?
My initial thought was to define URL location instead of original location — mostly because it seemed like the easiest way to scale up the analysis after this initial hand coding. But after discussing it, I really appreciated when Brent said, ‘Let’s just start this phase by avoiding thinking like computer scientists and code how we need to code without thinking about the algorithm.’ Instead, we tried to use this process as a way to develop a number of different ways of accurately locating sources and to see whether there were any major differences afterwards. Instead of using just one field for location, we developed three coding categories.
Country where the article’s subject is located | Country of the original publisher | Country of the URL publisher
We’ll compare these three to the:
Country of the administrative contact for the URL’s domain
that Shilad and Dave are working on extracting automatically.
When I first started doing the coding, I was really interested in looking at other aspects of the data such as what kinds of articles are being captured by the geotagged list, as well as what types of sources are being employed. So I created two new codes: ‘source type’ and ‘article subject’. I defined the article subject as: ‘The subject/category of the article referred to in the title or opening sentence e.g. ‘Humpety is a village in West Sussex, England’ (subject: village)’. I defined source type as ‘the type of site/document etc that *best* describes the source e.g. if the url points to a list of statistics but it’s contained within a newspaper site, it should be classified as ‘statistics’ rather than ’newspaper’.
Coding categories based on example item above from batch 2:
|subject||subject country||original publisher location||URL publisher location||language||source type|
In our previous project we divided up the ‘source type’ into many facets. These included the medium (e.g. website, book etc) and the format (statistics, news etc). But this can get very complicated very fast because there are a host of websites that do not fall easily into these categories. A url pointing to a news report by a blogger on a newspaper’s website, for example, or a link to a list of hyperlinks that download as spreadsheets on a government website. This is why I chose to use the ‘best guess’ for the type of source because choosing one category ends up being much easier than the faceted coding that we did in the previous project.
The problem was that this wasn’t a very conclusive definition and would not result in consistent coding. It is particularly problematic because we are doing this project iteratively and we want to try to get as much data as possible so that we have it if we need it later on. After much to-ing and fro-ing, we decided to go back to our research questions and focus on those. The most important thing that we needed to work out was how we were locating sources, and whether the data changed significantly depending on what definition we used. So we decided not to focus on the article type and source type for now, choosing instead to look at the three ways of coding location of sources so that we could compare them to the automated list that we develop.
This has been the hardest part of the project so far, I think. We went backwards and forwards a lot about how we might want to code this second set of randomly sampled citations. What definition of ‘source’ and ‘source location’ should we use? How do we balance the need to find the most accurate way to catch all outliers and a way that we could abstract into an algorithm that would enable us to scale up the study to look at all citations? It was a really useful exercise, though, and we have a few learnings from it.
– When you first look at the data, make sure you all do a small data analysis using a random sample;
– When you do the small data analysis, make sure you suspend your computer scientist view of the world and try to think about what is the most accurate way of coding this data from multiple facets and perspectives;
– After you’ve done this multiple analysis, you can then work out how you might develop abstract rules to accommodate the nuances in the data and/or to do a further round of coding to get a ‘ground truth’ dataset.
In this series of blog posts, a team of computer and social scientists including Heather Ford, Mark Graham, Brent Hecht, Dave Musicant and Shilad Sen are documenting the process by which a group of computer and social scientists are working together on a project to understand the geography of Wikipedia citations. Our aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized.