Medical News
Yves-Alexandre de Montjoye thinks there’s a problem with how we anonymise dataBryce Vickmark
By Donna LuTo protect privacy, data collected about us is sometimes anonymised before being used, such as for scientific research or by advertising companies wanting to hone their algorithms. The process involves removing personally identifiable information – including direct identifiers like names or photographs, and combinations of indirect identifiers such as workplace, occupation, salary and age.
Data anonymisation is supposed to be irreversible, but it’s relatively easy to reverse engineer the process, as Yves-Alexandre de Montjoye at Imperial College London and colleagues have found. This is because the more pieces of data you have about someone the more likely it becomes they are the only person who fits the bill. However, all is not lost. New techniques will help the fight for privacy, as De Montjoye explains.
What have you found?
Advertisement
We developed a machine learning model to assess the likelihood of reidentifying the right person. We took datasets and we showed that in the US fifteen characteristics, including age, gender, marital status and others, are sufficient to reidentify 99.98 per cent of Americans in virtually any anonymised data set.
How do you protect people’s personal data?
One approach is sampling. Let’s say I have data about a million people. And I only release data about 10,000 people – I will only give you 1 percent of the data of my customer base. If I try to find a person, the argument is that there are 99 other people that could be the person you’re searching for, but you don’t have their data.
What our model shows is that the incompleteness of the dataset is by no means sufficient to preserve people’s privacy. If I give you 1 percent of the data set, it protects the data of the 99 percent that you didn’t give me, but it does not offer any protection to the 1 percent of the people whose data you gave me.
Why is this a problem?
For close to close to 30 years, anonymisation is how we balance using data while preserving people’s privacy. The idea being that if your data is there, but I don’t know it’s yours, your privacy is preserved.
Looking at a dataset – there are a lot of people who are in their 30s, male and living in New York City. So it might not be me that you have re-identified. However, if I also know the person I’m searching for was born on January 5, is driving a red Mazda, has two kids, both of them are girls, has one dog, is living in a specific borough in New York City, then I have a pretty good chance to have identified the right person.
The main issue is that anonymisation is supposed to prevent reidentification, and it technically does not achieve what it is supposed to do anymore. Allegedly anonymous datasets are being sold to data brokers. The risk is that datasets that are being shared could be reidentified and reconciled to build increasingly comprehensive profiles of individuals.
It’s really time to rethink the way we approach data protection, and what constitutes truly anonymous data.
Is data reidentification legal?
It is not clear. From a regulatory standpoint, as soon as the data is anonymised, it’s not your data anymore, it is not subject to data protection laws, and you lose all the rights you have over this data. It’s not personal data anymore, so [people] can do whatever they want with it, including sharing it and selling it.
What’s the worst that can happen?
There is quite a large range of examples of datasets that were supposed to be anonymous and have been reidentified. In Australia, people at the University of Melbourne managed to reidentify medical data that were anonymised and then published by the government.
There was one big case in Germany in which lists of websites [people were visiting] were being sold. One add-on was collecting this data, and then trying to sell it very poorly anonymised. A journalist managed to pretend to be an interested buyer to get a sample of the data and reidentify individuals from those data.
How do we fix it?
It’s time to recognise that the tools are not working, and move on to a different range of techniques that will allow us to find a balance between using the data and preserving people’s privacy.
It’s increasingly thinking of privacy as information security – the way a company would do it from a cybersecurity perspective where you have a range of tools to protect servers, infrastructure, networks.
There’s cryptographic techniques that are being proposed, and these include, for example, secure multiparty computation or homomorphic encryption.
I think it’s important to move towards the equivalent of what is called penetration testing in security – reconsidering the risk and consistently testing that the tools that have been developed are still effectively protecting privacy.
Most of this will have to go through regulation. We’ve started to see some enforcement by the UK Information Commissioner’s Office and a couple of fairly large fines regarding data breaches.
Why don’t companies use these tools?
There are some companies that are starting to use these new solutions. But unless you tighten the guidelines of what constitutes truly anonymous data, it is simpler to deidentify the data than to deploy some of those tools and do things properly.
How do we protect our own data?
There’s a few things that would be the equivalent of locking your house. There is basic data hygiene in the sense of being aware of the information you’re giving out, ranging from information you answer to permission settings on apps.
But honestly – and I know this is an extremely frustrating answer – fundamentally, most of this will have to go through regulation and enforcement.
Journal reference: Nature Communications, DOI: 10.1038/s41467-019-10933-3
More on these topics:
privacy