We live in a world ruled by data in all realms, not just the scientific or mathematical but the political and the personal. This comes with both benefits and costs. The benefits are well known. The unprecedented access to evidence allows for more detailed analysis and more informed research, for instance. The costs, on the other hand, are typically tied to ethical problems raised by data collection regarding invasion of privacy, digital dossiers, and database misuse. The influx of data and our increasing willingness to turn to it, however, generates a more pernicious problem closely associated what makes our surplus of data so great. We erroneously assume data is equivalent to knowledge, and in doing so grossly overstate our ability to know.

In research, there is the notion of validity, or whether a particular metric accurately captures the phenomena one is trying to measure.  One can descend into the infinite regress of precision, complaining that any measurement is insufficiently exact or is merely a proxy. In the name of every person’s sanity, we state at a certain point that enough is enough and carry on.

The same is generally true for data collection. We can’t subject the world to constant census, which is why we tend to settle for random and representative. Now, however, we have datasets that auto-generate and are, theoretically, comprehensive. We can draw better conclusions because we have the whole “population”. Auto-generated datasets exist about very significant things. Metadata can paint quite the picture when used alone. Gain access to social media profiles, though, or credit card statements, browser histories–what can’t we prove?

This may be a less relevant question than one is led to believe, or its inverse might be more apropos: what can we prove? The issue of validity remains: are our data points truly tethered to the phenomena we’re investigating, or are they ineffective proxies? Further, what is the consequence of assuming all this data amounts to proof?

A data-centric culture asserts that things exist absolutely and can be known and proven. It therefore assumes a number of binary distinctions are at play, “objectively true” or “objectively not true” being the most important one. This has two consequences. First, our ability to “prove” leads to a desire to prove more, and our need for data continually increases. Helen Nissenbaum, in “Privacy in Context,” puts this best: our faith in information generates an “unquenchable thirst that can only be slaked by more information, fueling information-seeking behaviors of great ingenuity backed by determined and tenacious hoarding of its lodes.” Second, the focus on data distracts from those drawing conclusions from it, who remain not only intimately involved but ultimately at the wheel. Having access to evidence is good, but we can’t forget that data doesn’t tell stories. The people who analyze it do.

At a macro level, when we talk about data and information and knowledge, we tend to utilize a hierarchy. Data constitute the raw material. Information, perhaps, is what can be constructed from these elements, and knowledge is a type of “end” product that results from interpretation. From this vantage point, it’s clear that each transition is a level of abstraction–necessary, perhaps, to derive meaning from white noise, but abstraction nonetheless. The very act of negotiating an abstraction, however, involves intention, agency, and angle. There is a pattern one is seeking out or a finding one wants to reinforce. Thus, embedded within the generation of knowledge from data are junctures at which perspective must be injected in order for analysis to continue, flipping the premise of data-as-unquestionable-proof on its head by virtue of its fundamental reliance on subjectivity.

Many recognize, as I do, the obnoxious intellectual high ground occupied by critique and its status as a dead end for generativity (unless we critique the critique… of the critique). So let me be clear–I don’t think data is bad. I am not opposed to data, or big data, or the use of big data to draw conclusions. I merely want to stress that simply having access to data doesn’t solve for everything. We should not assume that data is knowledge, or more importantly that our analysis of data is necessarily proof. There are implications, of course, beyond pedantic discussions of the nature of “proof” and the subjectivity of analysis. Nissenbaum, again in “Privacy in Context,” points out that our interest in data is not new, but that the efficient, automated, and impersonal nature of its collection is. A real issue with data analysis–one entirely separate from the dilemma addressed here–is information and identity security, which are inherently threatened by the collection of data itself. Before aggregating data, then, we ought to consider just how much we’ll really be able to learn–and whether it’s worth it.

This piece appears in full on the author’s personal blog.