Since I launched this blog, I always wanted to write something about the dangers of big data! Things that can go wrong easily when you study a large scale transactional data. Obviously, I haven’t done this!

But recently we (Bertie, my PhD Student and I) just finished a paper titled: P-values: misunderstood and misused.

Of course statistical “misunderstanding” is one of the dangers of big data. Calculating p-values has become the most-used method to prove the “significance” of your analysis. However, as we say in the abstract:

P-values are widely used in both the social and natural sciences to quantify the statistical significance of observed results. The recent surge of big data research has made p-value an even more popular tool to test the significance of a study. However, substantial literature has been produced critiquing how p-values are used and understood. In this paper we review this recent critical literature, much of which is routed in the life sciences, and consider its implications for social scientific research. We provide a coherent picture of what the main criticisms are, and draw together and disambiguate common themes. In particular, we explain how the False Discovery Rate is calculated, and how this differs from a p-value. We also make explicit the Bayesian nature of many recent criticisms, a dimension that is often underplayed or ignored. We also identify practical steps to help remediate some of the concerns identified, and argue that p-values need to be contextualised within (i) the specific study, and (ii) the broader field of inquiry.

significant

“Significant”; taken from  http://xkcd.com/882/

 


Note: This post was originally published on Taha Yasseri's blog on . It might have been updated since then in its original location. The post gives the views of the author(s), and not necessarily the position of the Oxford Internet Institute.