Skip down to main content

Data Anonymisation, Partial Ownership, and Highly Correlated Data Sets

With Dr Joss Wright
Date & Time:
17:00 - 18:30,
Wednesday 12 February, 2014

About

The wealth of information being gathered and made available for analysis in both public and private settings has had a significant impact on our society.The rate at which data is gathered, stored, and analysed is accelerating,with accordingly significant positive and negative effects.

Data concerning individuals carries with it risks of identification of those individuals in the dataset and the revelation of sensitive attributes held by them. More importantly, even seemingly harmless datasets contribute to an ever-increasing landscape of auxiliary data sources that persist, and that may contribute to the compromise of individuals in the future.

This seminar will explore technical means by which data is traditionally anonymised and highlight how these anonymisation methods fail, particularly when compounded by various key features of genomic data. These failures have potentially serious consequences for data sharing and reproducibility in medical research.

We will conclude with a number of open questions regarding future research practices on highly sensitive and highly correlated data sets, and whether these practices may be made consistent with best principles of scientific reproducibility, open access to data, and data protection.

About the Seminar Series

The rapidly-declining cost of genomic sequencing promises many breakthroughs in our understanding of genetic predisposition to disease and for the development of medical treatments more precisely tailored to the individual patient. Much of this genomic data will end up in databases maintained by research and healthcare organisations (and increasingly by commercial “personal genomics” companies) which will have the ethical and legal responsibilities for preserving the privacy of such sensitive information. Unfortunately, recent research suggests that it is much more difficult than was first imagined to preserve the privacy of such information. Many existing methods for “de-identifying” or “anonymising” such data have been shown to be fragile: correlation of information from genomic databases, electronic health records and public sources such as genealogy and residence databases can often lead to surprisingly accurate inferences about the identities of individuals. If such information were to becomes widely available, it might compromise the ability of individuals to obtain health and life insurance, and might influence employment and even personal relationship decisions. Such information leakage might also well have a significant chilling effect on the public’s willingness to participate in research and clinical studies.

We are organising a series of seminars, funded by the Balliol Interdisciplinary Institute, to examine the current state of information privacy in this domain, and to look in particular at several questions:

  • To what extent can technology keep up with the arms race between “hackers” and data curators? Will recent advances in cryptography, database security architectures and “privacy preserving” data mining methods mitigate the risks, now and in the future?
  • What is the current state of legislation and regulation in this domain, and how is it likely to evolve in the face of developing attacks on privacy? Who actually owns and has control over genomic (and related health) data and its uses? Are there significant national and cultural differences which need to be taken into account (especially when data storage may transcend jurisdictional boundaries e.g. when data are stored in commercial “clouds”)?
  • To what extent does the appearance of patient-centric disease management portals such as PatientsLikeMe mitigate the concerns about privacy? Will patients’ altruistic urge to share information about themselves, their disease and their interactions with the healthcare system outweigh their concerns about their personal privacy? What is the appropriate balance between the public good which results from data sharing and the potential private loss?
  • What changes need be made to informed consent protocols to ensure that both researchers and donors fully understand and accept the risks associated with data collection and use?
  • If, as Scott McNealy (former CEO of Sun Microsystems) once said “Privacy is dead ñ get used to it,” and privacy is doomed to lose the arms race, what is the impact likely to be on public attitudes towards, and expectations of, personal genomic privacy? In a world where people are willing to commit intimate personal information to Facebook, should we even worry about the consequences of loss of genomic privacy? Or should we rather be addressing the issues inherent in completely open sharing of such information?

Answers to some or all of the above questions would have a profound impact on the practice of scientific research and medicine. A clear analysis of the risks, methods for mitigating those risks, and, alternatively, of the consequences of a deliberate policy of transparency, will help policy makers to develop realistic approaches to public education about, and the setting of guidelines for future research on, and exploitation of, personal genomic information.

Related Topics