Skip down to main content

Automating the Analysis of Unstructured Text on the Internet

Date & Time:
13:00:00 - 14:00:00,
Tuesday 29 March, 2011

About

The Internet provides a massive and ever changing ‘firehouse’ of unstructured text data including millions of moderately stable web pages and data repositories along with rapidly changing data such as news stories, blogs, tweets, and emails. These sources provide information on wide ranging topics including behavior on the Internet, how the Internet influences aspects of everyday life, the structure of the Internet itself and issues of importance to social science disciplines, business, government, and the military. The volume of data and rapid flow of new information renders traditional qualitative analysis ineffective; while the diversity, complexity, and lack of structure raise serious validity concerns for newer methods such as text mining and sentiment analysis.

This talk illustrates one approach to this problem based on Veyor, a new software service. This approach uses statistical quality control processes to train the program to code data in a limited domain to meet targeted standards of reliability and validity. Once trained, the programme can automatically code massive ongoing data streams with only minimal intervention by human coders to maintain standards. This permits high quality human-directed qualitative coding of the data capable of handling massive volumes, ongoing data streams, while meeting rigorous standards of reliability and validity exceeding those of standard practice.

This approach will be illustrated with two example research applications: The first study examined political discourse on the Internet to predict election results in the 2010 US Congressional elections, and illustrates the continuous automated monitoring of an ongoing stream of data on the Internet. The second study analyzed student responses to an open-ended question of why they might cheat, and illustrates how Veyor can be trained and validated for one study and then applied to subsequent studies at reduced cost and increased speed while assuring a higher standard of consistency across studies than permitted by traditional methods.

Data Dump to delete

Speakers

  • Name:
  • Affiliation:
  • Role:
  • URL:
  • Bio:

Papers

University of Missouri

Privacy Overview
Oxford Internet Institute

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies
  • moove_gdrp_popup -  a cookie that saves your preferences for cookie settings. Without this cookie, the screen offering you cookie options will appear on every page you visit.

This cookie remains on your computer for 365 days, but you can adjust your preferences at any time by clicking on the "Cookie settings" link in the website footer.

Please note that if you visit the Oxford University website, any cookies you accept there will appear on our site here too, this being a subdomain. To control them, you must change your cookie preferences on the main University website.

Google Analytics

This website uses Google Tags and Google Analytics to collect anonymised information such as the number of visitors to the site, and the most popular pages. Keeping these cookies enabled helps the OII improve our website.

Enabling this option will allow cookies from:

  • Google Analytics - tracking visits to the ox.ac.uk and oii.ox.ac.uk domains

These cookies will remain on your website for 365 days, but you can edit your cookie preferences at any time via the "Cookie Settings" button in the website footer.