On 13 September 2013, I attended a great workshop at the Harvard Faculty Club organized by my colleagues Vicki Nash and Helen Margetts of the Oxford Internet Institute and the journal Policy and Internet on the topic of “Responsible Research Agendas for Public Policy in the Era of Big Data”. The assembled body included not just the usual compliment of academics, but also a significant number of people working in, and closely with, U.S. governmental bodies. The participants from agencies such as the FCC, Federal Reserve Board, Bureau of Labor Statistics, and Census Bureau added immeasurably to our day of discussions on how to understand the relationship between big data and public policy.

I won’t recount the whole day of discussions, which ranged from discussions of how to move toward systems-level thinking using big data once we have a ‘map the size of the world’ at our disposal to debates about how big data can contribute both to the public good in the best cases and, conversely, to the public ‘bad’ in the most troubling cases. Instead, let me draw out just a couple of themes that I found most interesting.

A key point that arose on several occasions has to do with the disconnect between what governments at all levels (i.e. national, state/regional, local) potentially have access to in terms of data, and actually have access to in practice. While recent revelations about the NSA have many thinking of western governments as panopticons, at the agency level it is clear that the situation is far more complex. For example, government agencies gathering data to aid decision-making on economic policy based on sales data and labor statistics traditionally have access to rather broad brush data about things like when raw materials leave the U.S. and when goods return. The global firms which are actually manufacturing and marketing the final goods, however, have extremely detailed records about the entire change of production and sales, but do not have any regulatory requirement to turn these data over to the government, even though it could potentially contribute to policy decisions based on much more complete information. It was suggested during the workshop that one approach would be to approach these firms and ask them to turn it over to appropriate agencies out of patriotic duty, but of course the flip side is that many firms have a self interest in avoiding regulation, and may feel that staying a step ahead of governments gives them an advantage. Additionally, of course, these data are understood to be key assets for companies, and sharing them could easily be seen as undermining their competitiveness. Several participants argued that the growing gap between leading industries and lagging governmental bodies results in regulations that are doomed to failure as the regulators aren’t keeping up with the industries they are meant to be regulating.

It was pointed out, however, that even though the private sector clearly has more big data available to them at the moment that either academics or (some sectors of) the government, we shouldn’t fall prey to thinking that any sector was surging ahead in the careful and innovative use of big data. Clearly, all have ample evidence of naïve or downright poor uses of big data, even while the examples of powerful uses of big data are the most hyped.

One point I noted throughout the day as the group discussed big data for public policy is that there are clearly different levels of policy that were blurring together in our discussion. While some issues overlap, there are different issues when discussing how financial regulators can draw on data-driven approaches to monitoring and regulating the banking sector, as compared to how local governments can use data and algorithms to better inform public-facing case workers who are facing benefits claimants across the desk. Both have considerable potential for improving policy using data and algorithms to reach more effective decisions regarding their constituents. However, bankers are more likely to have considerable expertise at their disposal both to understand the algorithms upon which these decisions are based, and also more ability to push back against any decisions they disagree with. For the benefit claimant, the potential for being faced with inflexible decisions when neither side of the desk understands the complex underlying algorithms driving these decisions is much higher.

The risks, it was argued during the workshop, can be thought of in several key categories, including the risk of big data leading to more unequal treatment of individuals, the risk of misuse of data, and the risk that big data insights will become so compelling that we can’t, at some level, avoid acting on them, but then risk facing a backlash from portions of the public unwilling to accept the power of the algorithms as justification for public policy.

The final portion of the day grappled with how best to train the data scientists to work with these data. It was generally agreed that the data scientist of the future needs skills that go far beyond technical programming skills or narrow analytical skills. The ability to work on multidisciplinary teams is key, and the data scientists who want to lead these teams will need to be able to speak multiple ‘languages’ (in the sense of the language of engineering, of science, of management, etc.). They will need diplomatic skills, and enough understanding of the domains they are working in to know what makes sense, rather than just what the data may appear to show, in order to avoid acting on false correlations. The data scientists need to become really smart in posing questions, knowing what data points are most useful to have access to, and to understand what uses of data are most beneficial for society, and which are most harmful to democracy and society.

All these discussions will feed my thinking as we continue to work on our project which is aimed at getting beyond the hype around big data and discovering what the actual practices, research possibilities, and strategies for working with big data are in the social sciences.