Mapping collective public opinion in the Russian blogosphere
Blogs are becoming increasingly important for agenda setting and formation of collective public opinion on a wide range of issues — particularly in countries like Russia where the Internet is not technically filtered, but where the traditional media is tightly controlled by the state. Olessia Koltsova, author (with Sergei Koltcov) of the Policy and Internet paper Mapping the public agenda with topic modeling: The case of the Russian livejournal discusses how topic modeling and sentiment analysis techniques can be used to monitor self-generated public opinion.
Blogs are becoming increasingly important for agenda setting and formation of collective public opinion on a wide range of issues. In countries like Russia where the Internet is not technically filtered, but where the traditional media is tightly controlled by the state, they may be particularly important. The Russian language blogosphere counts about 85 million blogs – an amount far beyond the capacities of any government to control – and the Russian search engine Yandex, with its blog rating service, serves as an important reference point for Russia’s educated public in its search of authoritative and independent sources of information. The blogosphere is thereby able to function as a mass medium of “public opinion” and also to exercise influence.
One topic that was particularly salient over the period we studied concerned the Russian Parliamentary elections of December 2011. Widely reported as fraudulent, they provoked immediate and mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia, as well as corresponding activity in the blogosphere. Protesters made effective use of the Internet to organize a movement that demanded cancellation of the parliamentary election results, and the holding of new and fair elections. These protests continued until the following summer, gaining widespread national and international attention.
Most of the political and social discussion blogged in Russia is hosted on the blog platform LiveJournal. Some of these bloggers can claim a certain amount of influence; the top thirty bloggers have over 20,000 “friends” each, representing a good circulation for the average Russian newspaper. Part of the blogosphere may thereby resemble the traditional media; the deeper into the long tail of average bloggers, however, the more it functions as more as pure public opinion. This “top list” effect may be particularly important in societies (like Russia’s) where popularity lists exert a visible influence on bloggers’ competitive behavior and on public perceptions of their significance. Given the influence of these top bloggers, it may be claimed that, like the traditional media, they act as filters of issues to be thought about, and as definers of their relative importance and salience.
Gauging public opinion is of obvious interest to governments and politicians, and opinion polls are widely used to do this, but they have been consistently criticized for the imposition of agendas on respondents by pollsters, producing artefacts. Indeed, the public opinion literature has tended to regard opinion as something to be “extracted” by pollsters, which inevitably pre-structures the output. This literature doesn’t consider that public opinion might also exist in the form of natural language texts, such as blog posts, that have not been pre-structured by external observers.
There are two basic ways to detect topics in natural language texts: the first is manual coding of texts (ie by traditional content analysis), and the other involves rapidly developing techniques of automatic topic modeling or text clustering. The media studies literature has relied heavily on traditional content analysis; however, these studies are inevitably limited by the volume of data a person can physically process, given there may be hundreds of issues and opinions to track — LiveJournal’s 2.8 million blog accounts, for example, generate 90,000 posts daily.
For large text collections, therefore, only the second approach is feasible. In our article we explored how methods for topic modeling developed in computer science may be applied to social science questions – such as how to efficiently track public opinion on particular (and evolving) issues across entire populations. Specifically, we demonstrate how automated topic modeling can identify public agendas, their composition, structure, the relative salience of different topics, and their evolution over time without prior knowledge of the issues being discussed and written about. This automated “discovery” of issues in texts involves division of texts into topically — or more precisely, lexically — similar groups that can later be interpreted and labeled by researchers. Although this approach has limitations in tackling subtle meanings and links, experiments where automated results have been checked against human coding show over 90 percent accuracy.
The computer science literature is flooded with methodological papers on automatic analysis of big textual data. While these methods can’t entirely replace manual work with texts, they can help reduce it to the most meaningful and representative areas of the textual space they help to map, and are the only means to monitor agendas and attitudes across multiple sources, over long periods and at scale. They can also help solve problems of insufficient and biased sampling, when entire populations become available for analysis. Due to their recentness, as well as their mathematical and computational complexity, these approaches are rarely applied by social scientists, and to our knowledge, topic modeling has not previously been applied for the extraction of agendas from blogs in any social science research.
The natural extension of automated topic or issue extraction involves sentiment mining and analysis; as Gonzalez-Bailon, Kaltenbrunner, and Banches (2012) have pointed out, public opinion doesn’t just involve specific issues, but also encompasses the state of public emotion about these issues, including attitudes and preferences. This involves extracting opinions on the issues/agendas that are thought to be present in the texts, usually by dividing sentences into positive and negative. These techniques are based on human-coded dictionaries of emotive words, on algorithmic construction of sentiment dictionaries, or on machine learning techniques.
Both topic modeling and sentiment analysis techniques are required to effectively monitor self-generated public opinion. When methods for tracking attitudes complement methods to build topic structures, a rich and powerful map of self-generated public opinion can be drawn. Of course this mapping can’t completely replace opinion polls; rather, it’s a new way of learning what people are thinking and talking about; a method that makes the vast amounts of user-generated content about society – such as the 65 million blogs that make up the Russian blogosphere — available for social and policy analysis.
Naturally, this approach to public opinion and attitudes is not free of limitations. First, the dataset is only representative of the self-selected population of those who have authored the texts, not of the whole population. Second, like regular polled public opinion, online public opinion only covers those attitudes that bloggers are willing to share in public. Furthermore, there is still a long way to go before the relevant instruments become mature, and this will demand the efforts of the whole research community: computer scientists and social scientists alike.
Read the full paper: Olessia Koltsova and Sergei Koltcov (2013) Mapping the public agenda with topic modeling: The case of the Russian livejournal. Policy and Internet 5 (2) 207–227.
Also read on this blog: Can text mining help handle the data deluge in public policy analysis? by Aude Bicquelet.
González-Bailón, S., A. Kaltenbrunner, and R.E. Banches. 2012. “Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions,” Human Communication Research 38 (2): 121–43.