Skip down to main content

Automating the Analysis of Unstructured Text on the Internet

Date & Time:
13:00:00 - 14:00:00,
Tuesday 29 March, 2011


The Internet provides a massive and ever changing ‘firehouse’ of unstructured text data including millions of moderately stable web pages and data repositories along with rapidly changing data such as news stories, blogs, tweets, and emails. These sources provide information on wide ranging topics including behavior on the Internet, how the Internet influences aspects of everyday life, the structure of the Internet itself and issues of importance to social science disciplines, business, government, and the military. The volume of data and rapid flow of new information renders traditional qualitative analysis ineffective; while the diversity, complexity, and lack of structure raise serious validity concerns for newer methods such as text mining and sentiment analysis.

This talk illustrates one approach to this problem based on Veyor, a new software service. This approach uses statistical quality control processes to train the program to code data in a limited domain to meet targeted standards of reliability and validity. Once trained, the programme can automatically code massive ongoing data streams with only minimal intervention by human coders to maintain standards. This permits high quality human-directed qualitative coding of the data capable of handling massive volumes, ongoing data streams, while meeting rigorous standards of reliability and validity exceeding those of standard practice.

This approach will be illustrated with two example research applications: The first study examined political discourse on the Internet to predict election results in the 2010 US Congressional elections, and illustrates the continuous automated monitoring of an ongoing stream of data on the Internet. The second study analyzed student responses to an open-ended question of why they might cheat, and illustrates how Veyor can be trained and validated for one study and then applied to subsequent studies at reduced cost and increased speed while assuring a higher standard of consistency across studies than permitted by traditional methods.

Data Dump to delete


  • Name:
  • Affiliation:
  • Role:
  • URL:
  • Bio:


University of Missouri