Cookies on Pinsent Masons website

This website uses cookies to allow us to see how the site is used. The cookies cannot identify you. If you continue to use this site we will assume that you are happy with this

If you want to use the sites without cookies or would like to know more, you can do that here.

Is anonymisation a myth?

We look at new research which claims that people whose information is contained in supposedly anonymised databases can in fact be commonly identified.19 Nov 2009


A text transcription follows.

This transcript is for anyone with a hearing impairment or who for any other reason cannot listen to the MP3 audio file.

The following is the text spoken by OUT-LAW journalist Matthew Magee.


Hello and welcome to OUT-LAW Radio, where we hope to keep you up to date with the latest news and the most fascinating features from the world of technology law.

My name is Matthew Magee, and this week we talk to a law professor who says that techniques used to supposedly mask our identities in massive databases don't in fact work.

But first, here are some of the top stories from OUT-LAW.COM, where you can read breaking technology law news throughout the week.

Founding investor says YouTube is profitable

And

Government confirms net disconnection law.

YouTube is profitable and has been for 18 months according to the venture capitalist who made the first major investments in both YouTube and its parent company, Google. Google has suggested that the video sharing site has yet to make money.

Michael Moritz is a partner in Sequoia Capital, one of Silicon Valley's most respected venture capital companies. He was a lead investor in its investment in Google in 1999, less than a year after it was founded, and the first major investor in YouTube.

He told the BBC that YouTube has been "very profitable" for the last 18 months. When it announced its most recent financial results in October, Google's chief financial officer Patrick Pichette said that "YouTube is on its path to profitability in the not-too-distant future", a comment which suggests that the company is not yet making a profit.

The Government has confirmed that it will pass legislation allowing for the termination of internet connections used by suspected illegal file sharers but has not yet said whether the action will be subject to independent or Court oversight before it takes place.

The Digital Economy Bill was announced by the Government yesterday and is expected to be published tomorrow. It commits the Government to passing a law introducing disconnections but makes no mention of Court oversight.

A statement from the Prime Minister's office said that the Digital Economy Bill would "tackle widespread copyright infringement via a two-stage process. First by making legal action more effective and educating consumers about copyright on-line. Second through reserve powers, if needed, to introduce technical measures, such as disconnection.

Consumer protection advice organisation Which? said that the plans could lead to the disconnection of innocent internet users.

Those were some of the top stories from this week's OUT-LAW News.


Companies, governments and all sorts of large organisations these days have massive amounts of information about us. Collectively they probably know more about us than we know ourselves.

We have conflicting demands in relation to that information. If someone has gone to the trouble of gathering, sorting and storing it, we want the best use to be made of it. It should be analysed, all the parts of it collected together to spot trends or needs or opportunities for efficiency.

But at the same time we want an organisation to protect our privacy, to make sure that nobody else ever has access to our most secret information.

If an organisation wants to collate lots of people's private information to analyse it they currently use a very simple method: anonymisation. They take out your name or any identifying characteristics from your record in the database and use the rest of the information. That leaves you with safely private but very useful data.

Well, new research suggests that maybe it doesn't, that perhaps anonymisation doesn't actually do what it claims.

University of Colorado Law School Professor Paul Ohm has produced a paper arguing that anonymisation is a failure. Before telling us why, Ohm explained what anonymisation is supposed to do.

Professor Paul Ohm: By anonymisation I mean the kind of persistent belief that by removing pieces of information from a database, like your name, your identification number, your home address that you can preserve or protect the privacy of the people described in the data and that with the power of anonymisation, the supposed power of anonymisation, you can share the data with anyone you want, you can store the data for as long as you would like and traditionally it has been kind of a conversation stopper once you assert anonymisation everyone nods their heads and says ‘that’s fine privacy is protected here, let’s focus on something else.’

But all may not be as it seems. Increases in the brute computing power available to people and the creation of lots of databases which can be linked or compared may have forever undermined the ability to anonymise large collections of data, said Ohm.

Professor Paul Ohm: The idea here is that even though you are deleting many of the identifying fields of information everything you leave behind retains identifying power. On of the classic studies was a record of movies that people had rated on the Netflix movies service here in the United States are, and the idea is, if I know two or three or four movies that you have rated and I know approximately when you rated them, I am probably going to be able to identify you uniquely in a pool of hundreds of thousands of other people.  I need a rich pool of information that I could then link to these unique records. I need to know something about a person’s movie habits that are attached to their identity and, you know, 20 years ago people would say ‘well that’s fine, we don’t need to worry about that because there isn’t some magical well of information about people’s movie viewing habits but whereas today we have that magical well, it's called the Internet. I am going to be able to figure out my worst enemy’s past 20 movie viewings and with that I could then link them back to this other database that was supposedly anonymised and in the process figure out every movie they have seen not just the three or four I already know about.

Ohm said that you can identify people with just a handful of pieces of supposedly anonymised personal information.

Professor Paul Ohm: Well, one researcher in Massachusetts about 15 years ago discovered that 87% of Americans are uniquely identified by three pieces of information.  Their date of birth, their sex and their zip code.  These are just the regional postal codes.  It turns out there are something like 20,000 to 40,000 people living in a zip code and if you had those three pieces of information you could identify virtually everyone in the United States. Now the problem was until she announced her finding, zip code, birth date and sex were three pieces of information that we presumed were privacy protecting, were anonymised. And so they appeared in all sorts of databases, they appeared in voter rolls that you could buy from your Government for $20, they appeared in health diagnoses databases full of every visit you ever had to the doctor, so these three pieces of information which it turns out are like skeleton keys that unlock identity pervaded databases in the mid nineties and so just by combining those databases you could learn lots about individual people.

How has this come about?  What makes it possible to identify people in anonymised databases now that wasn’t possible before. Well, Ohm said that the increase in the number and complexity of databases is itself a partial cause of the failure of anonymisation.

Professor Paul Ohm: Computers have become more powerful and so you can do computations that you could not do 15 years ago. There is much, much, much more data available on the internet then there used to be. And so those are the two technological changes. And in particular when I talk about more data there is also an increasing tendency to release data publically and that is where this research has really taken off. If you can get, you know, millions of records of human behavior you can run queries and analysis you did not used to do.

The failure of the anonymisation process is crucially important because we have placed so much faith in it. People have released vast amounts of data in a way that they thought protected individuals' privacy.

Professor Paul Ohm: There is an enormous amount of faith in anonymisation. And this is faith at every level, faith among the technology community, faith along the policy makers, but here is the very interesting thing I noticed. The faith in anonymisation has been embedded in law, and so virtually every law I examined that has to with protecting privacy contains what I call a “get out of jail free” card for anonymisation. Virtually every privacy law allows you to escape the strictures and the requirements of the privacy law completely once you have anonymised your data, and I think if nothing else what I am arguing in this paper is that every policy maker who has ever encountered a privacy law and that is in every country on earth, will need to re examine one of the core assumptions they made when they wrote that law. They need to re examine whether or not they are giving away too much in the face of anonymisation. 

So what can we do? How can we actually make use of data – process it, analyse it, streamline systems and processes based on what we learn from it - without compromising privacy?

Ohm said that it is a very difficult problem to solve. Perhaps, he suggests, we should shift the burden of privacy away from the data itself and on to the people that we allow to see it.

Professor Paul Ohm: It is a very hard problem. It is an extremely hard problem because what makes data re identifiable is the exact same thing that makes data useful. And so any prescription I can describe will have to require reducing the utility of data. Health privacy is something that I have begun to think a lot about. We can’t trust technology anymore, but at the same time we don’t want to keep this information from researchers. And so my solution is we shift our trust from the technology from the computers to the people. We write down the rules of trust among health researchers. Right, we say that researchers, trust individual researchers, they don’t necessarily trust every undergraduate in the lab and then once we have codified kind of these rules of trust we write a bunch of new rules that say we’re going to be like the American National Security Agency. You can get my data but only on a need to know basis. You have to sign a binding document that says you are going to obey my privacy rules. You have to install a piece of software that will watch you as you use this data to make sure that you are not using it more than you need to. And so on the one hand, on the one hand this will stifle health research because it will make it more difficult to work with individual databases. On the other hand it will improve health research because currently health researchers are often deprived pieces of information like the patients' birthday or the patient’s home address, because we think that is providing privacy. I want to give them more data but under tighter control.


That's all we have time for this week, thanks for listening. Why not get in touch with OUT-LAW Radio? Do you know of a technology law story? We'd love to hear from you on radio@out-law.com. Make sure you tune in next week; for now, goodbye.