One of the most misunderstood topics in privacy is what it means to provide “anonymous” access to data. One often hears references to “hashing” as a way of rendering data anonymous. As it turns out, hashing is vastly overrated as an “anonymization” technique. In this post, I’ll talk about what hashing is, and why it often fails to provide effective anonymity.
What is hashing anyway? What we’re talking about is technically called a “cryptographic hash function” (or, to super hardcore theory nerds, a randomly chosen member of a pseudorandom function family–but I digress). I’ll just call it a “hash” for short. A hash is a mathematical function: you give it an input value and the function thinks for a while and then emits an output value; and the same input always yields the same output. What makes a hash special is that it is as unpredictable as a mathematical function can be–it is designed so that there is no rhyme or reason to its behavior, except for the iron rule that the same input always yields the same output. (In this post I’ll use a hash called SHA-1.)
With that out of the way, let’s consider whether hashing a Social Security Number renders it “anonymous”. If you hash my SSN, the result is b0254c86634ff9d0800561732049ce09a2d003e1. (Let’s call this the “b02 value” for short.) That looks nothing like my SSN–but that in itself does not make the value “anonymous”. The key question is whether a person who gets the b02 value can figure out what my SSN is.
How might an analyst who has the b02 value try to determine my SSN? One approach that doesn’t work is to try to run the hash function backward–or as a mathematician would say, to find its inverse. Many functions can be run backward. Consider the function that adds 17 to its input. To run that function backward, you just subtract 17. The hash has an inverse (of a sort) but nobody knows what it is, and as far as anyone knows it is not feasible to find the inverse. So a smart analyst will give up on the invert-the-hash approach.
But there is another trick available to the analyst–and this trick will work. The analyst simply guesses my SSN–he enumerates all of the possible nine-digit SSNs and hashes each one. When he hashes my correct SSN, the result will be equal to the b02 number, so he will know that he guessed right. You might think it would take a long time to run through all of the possible SSNs, but computers are very fast–there are “only” one billion possible SSNs, so your laptop can hash all of them in less time than it takes you to get a cup of coffee.
A clever analyst would do it even faster. He would hash all of the possible SSNs in advance, and build an index that allowed him to recover the SSN from its corresponding hash value in the blink of an eye. Hashing the SSN would offer no protection at all against an analyst who had built such an index.
It should be clear by this point that hashing an SSN does not render it anonymous. The same is true for any data field, unless it is much, much, much harder to guess than an SSN–and bear in mind that in practice the analyst who is doing the guessing might have access to other information about the person in question, to help guide his guessing.
Does this means that hashing always fails, and is never a good way to scrub data? Almost, but not quite. There are more advanced uses of hashing that can offer some protection in some settings. But the casual assumption that hashing is sufficient to anonymize data is risky at best, and usually wrong.
[In case you’re wondering, the b02 value is not really the hash of my SSN. It is the hash of the text string “my SSN”. There is no way I would publish the hash of my actual SSN.]
Note: This blog post and its comments were reposted from the former Tech @ FTC blog hosted on a third party site.
April 23, 2012 at 12:57 pm
This is why we use salts. In theory the analyst would need to enumerate all possible SSNs for all possible salts.
April 23, 2012 at 2:57 pm
A salt does provide a strong guarantee, if you (a) choose the salt in a strong random fashion, and (b) keep the salt secret so the analyst can’t get it.
April 23, 2012 at 11:28 pm
And, of course, if the salt is long enough. We really need a better word than “salt” for this usage, since in other contexts (salted password verification, for example), the salt value is not necessarily secret and is typically stored with the hashed result.
April 23, 2012 at 11:44 pm
Hashing can also interfere with the transparency of identifiers used to track users. If, rather than a unique identifier like an iPhone UDID or an Ethernet MAC address, a hashed value is used instead, this can be much harder for researchers to detect. The hashed value can also be salted (as above) or combined with a reversable function to allow multiple colluding parties to track the user in a way that is difficult to detect in traffic analysis. I’m quite concerned that many proposals for replacing the deprecated iPhone UDID may make tracking much less transparent.
April 24, 2012 at 7:15 pm
Ugh, no. Salting only means you can’t precalculate the one billion hashes associated with the one billion possible SSN’s. It doesn’t mean you can’t brute force a particular SSN# given the salt with a 2^30 (small) work effort.
It’s not meaningful to discuss salts being kept secret. The whole concept of a salt is that it’s in the clear next to the hash; an attacker able to do an offline attack against hashes is effectively by definition able to do offline attacks against salted hashes too. I wrote a decent amount about this here: http://dankaminsky.com/2012/01/05/salt-the-fries-some-notes-on-password-complexity/
April 24, 2012 at 9:07 pm
Dan: good point. I shouldn’t have used the term “salt”. What you need is to hash the SSN along with a suitably generated secret, and then protect the secret. Then the hashed values actually will be inscrutable to anyone who doesn’t know the secret.
May 15, 2012 at 5:08 pm
I’m wondering if the discussion here takes enough account of some use cases that may be relevant. Seems to me that Ed is correct that the point is to have “a suitably generated secret, and then protect the secret.” In practice in industry, there may be lots of employees who see a hashed SSN. There is a salt that is unknown to those employees. For attacks from these insiders, a hashed SSN and a secret hash would seem to provide important protection against re-identification.
May 16, 2012 at 10:11 am
While hashing PII of fixed length like an SSN or a phone # may not render it anonymous, do you agree that hashing an email address of variable length could come close to anonymity due to the number of permutations?
May 22, 2012 at 8:19am
I agree that the scenario you describe is important to get right. But I don’t think hashing is the solution. If an employee shouldn’t be seeing the SSN, then don’t show them the SSN. There’s not much point to showing them a hashed SSN–all it does is distract the employee and occupy screen space, without conveying any useful information to the employee.
September 29, 2012 at 10:22 am
Howdy just wanted to give you a quick heads up. The text in your post seem to be running off the screen in Safari. I’m not sure if this is a formatting issue or something to do with web browser compatibility but I figured I’d post to let you know. The style and design look great though! Hope you get the problem fixed soon.
October 2, 2012 at 5:32 pm
Norman: Hmm — I don’t see any such problem when I view the page on a Mac. Are you using mobile Safari? A window width problem, perhaps?
November 28, 2012 at 6:28 am
Hello there, just became aware of your blog through Google, and found that it is really informative. I am gonna watch out for brussels. I will be grateful if you continue this in future. A lot of people will be benefited from your writing. Cheers!
November 29, 2012 at 7:55 am
I was curious if you ever considered changing the structure of your blog? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having one or two images.
Maybe you could space it out better?
December 12, 2012 at 1:13 pm
Thanks for the observation. The answer has two parts. First, in September I succeeded Ed Felten as Chief Technologist, and took over the blog; I can only comment on the structure of posts since then. Of course, my posts don’t have much in the way of pictures and diagrams, either, but that’s because I don’t tend to think in those terms. I should, and I’ll try to do better, but you’re not the first to make that observation about my writing… –Steve Bellovin