A strange tale of runaway statutory interpretation
I get the intellectual privacy concern, but people need to realize we lost that war without a fight over 20 years ago. All your data is easily acquired by anyone who wants it. There is nothing you can do about it, unless you are willing to live in a cave. And even then, wait til the satellite cameras get better resolution....
No doubt, this is bad for people who really do need protections - such as people trying to flee abusive ex spouses, people in witness protection, etc. But being mad about that is like being mad at an asteroid coming at you from space... be mad all you want, but the asteroid doesn't care about your feelings.
"privacy concerns" are a mind virus that infects all of the most neurotic, low-trust people in this country. they have been a real blight to the discourse on all kinds of issues. never once have I heard an American do a sky-is-falling routine about how some thing is going to destroy their privacy and then have that thing actually turn out to be bad. it's ok for people to know things about you. they won't use it to hurt you. see a therapist. etc.
While they're at it, could the Census Bureau also give everyone a free ID, and solve the voter ID controversy once and for all, as well as alleviate a lot of the other hardships people encounter when they don't have a valid ID? Issuing national IDs would be a reasonable method of implementing the Constitutional mandate to maintain an "actual enumeration" of every person in the country.
The Baileys don't give a shit about the Census. Changing it just risks drawing their attention to it and scaring them.
I did a lot of research on this for my MPP a few months ago (and was saved a lot of time by the links in Matt's last piece about this, so thank you Matt!). I came away with the suspicion that this is in large part a hobby horse of John Abowd, economist and Chief Scientist for the Census Bureau. That's very speculative on my part - I'm an onlooker, not someone intimately involved in the issue, so everyone reading this should very much take what I say below with a grain of salt - but here's what made me get that impression.
If you listen to how the Bureau talks about differential privacy, it doesn’t present itself as aiming for some legal threshold. Instead, it talks about trying to optimize a bundle of data usability and data privacy based on Americans’ preferences.
John Abowd has published several papers arguing differential privacy allows us to quantify the trade-off between privacy and useability. He thinks this is great, because if we connect it to survey data on how much people value privacy and usability, it lets us find the bundle that maximizes utility.
Most people, including me, think this is BS. Differential privacy only lets you quantify the risk of exact attribute disclosure (the risk of correcting deducing the value of a specific datapoint – say, the age of a specific person in the microdata). It has nothing to say about identity disclosure risk (the risk of correct attributes being connected to a person's PII using a third party database). And identity disclosure is (1) the only type of privacy people actually care about, and (2) the only type of privacy Title 13 cares about (the law that says the Bureau can’t release PII). After all, the Bureau has been releasing exact attributes in microdata for a while now!
But the bureau still uses the optimization framework in its public communications. It gives me the impression that John Abowd is just really caught up in his research angle. Again, that’s very much speculation on my part, but I don’t think it’s unreasonable. Abowd has a weird interpretation of Title 13 - he believes that "re-identification risk is only one part of the Census Bureau’s statutory obligation to protect confidentiality. The statute also requires protection against exact attribute disclosure.” I haven't been able to find *anybody* else who thinks this, so I can't help but wonder if it's motivated reasoning.
This is all very reasonable, but John Abowd at Cornell Economics led the Census Bureau's charge on this, and while he may be wrong on this issue, he's neither evil nor stupid. For all the controversy and criticism about this change, I'm shocked that there's no "John Abowd faces his critics" type interview anywhere. Presumably he has SOME responses to these critiques, but I feel like the conversation never directly engages both sides of the argument and people kind of just talk past each other.
This is a fascinating topic for me to discuss because while I was in applied math grad school at MIT, my research was tangentially related to the academic discipline of differential privacy, which is generally seen as one of the hot up-and-coming fields in computer science. It is genuinely exciting for many of those researchers to be able to see real-world applications on something as high profile as the Census, and while I'm not personally motivated by privacy concerns, I sympathize with the attempt to try to quantitatively balance privacy and accuracy.
One of the things that I would emphasize in this discussion is that all of the various knobs in these algorithms are adjustable. If they're indeed overvaluing privacy as you claim, they can always turn the amount of noise down to produce generally more accurate counts the next time around. If the adjustment algorithms prioritize getting the wrong counts right, they can adjust the algorithm to get the counts that matter more right. Just as you're questioning the value of the privacy offered, one could also question the value of exact accuracy, and perhaps attempt to offer an actual cost-benefit analysis. I fully expect academics to dissect this a thousand ways and be able to come to some sort of general consensus of the different options and their tradeoffs over time.
I'd also just emphasize how new all of this. DP as an academic discipline really only got kicked off with a 2006 paper establishing how noise injection leads to privacy. There's been a lot of academic work since then, but handling the complexities of applying this to real world data is just getting started.
I think it's time to retire the tired trope that private companies know everything about us when the evidence given is some case of a teenage pregnancy back in 2014. If that's all the evidence we can muster, well, then you know . . . My experience is that company tracking is pretty dumb. It's the bounteous ads I get because *I just bought the damn thing they're advertising.* Or you google something once and you get obnoxious ads peppering your screen for it for months, until you swear you will never use that service ever (looking at you, zipcar. com)
As for the real substance of this post, I'd say my bias is toward more protection of personal information from our government than less, but I admit that that's not a slam dunk position of mine. Like so much of life, unlike my hatred of zipcar. com ads, it's complicated.
think about how much collective time has been wasted by all of clicking "Allow Cookies"
like, who exactly are these privacy nazis? Who is it that cares this much?
Hi All. I run a privacy think tank (FPF.org) - a pragmatic centrist think tank focused on helping support responsible uses of data, working with the Chief Privacy Officers of many data driven companies, researchers, cities and schools....I am a data optimist and enthusiast, not an overly cautious fanatic. I have closely followed this issue. My informed view is 1) the actual privacy risk of releasing large overlapping sets of data is real. The Census will never be trusted if we release data as in the past and it is used to disclose details people consider confidential. 2) For many years, the Census has added noise to data to support deidentification. Many users of these data sets havent been aware, or simply relied on the final data. Now the Census is transparently explaining the techniques used. 3) Changing the way data is released is painful due to the transitions that will need to be made by those relying on the previous techniques, and the tradeoffs required by the protections needed. 4) Differential privacy is currently the most sophisticated way to assess techniques that add some noise to data, in a manner that maintains accuracy at aggregated levels. My interview with one of the "inventors" of differential privacy can be viewed here. https://www.linkedin.com/video/live/urn:li:ugcPost:6783434965402075136/
Here's a partial solution to the issue - as well as a slight correction on Matt’s description of noise at larger geographies versus smaller ones.
The biggest distortions don’t come from differential privacy on its own. The noise added is pretty small, and is normally distributed which is good.
The worst distortions come from modifications made to the correct for problems the normally-distributed noise causes:
(1) Values that look weird – say, a Census Block with negative 3 houses, or a county with 800.354 people – are changed to look normal.
(2) Data for different levels are modified so all levels line up (for example, the population of states might be modified to sum to the population of the country). They have to do this because the noise gets added to summary data at every level, making different levels not cohere with one another. They start modifying at the top and work down, which is why smallest geographies are the worst - they're affected by all the modifications at the levels above them.
A bunch of researchers have suggested the Bureau also release data with just the original normally distributed noise - not the corrections for weird values and for alignments across levels. Last I heard, the Bureau wasn’t considering this, which is super weird - the data with just the normally distributed noise is still differentially private, and would be *way* more useful to researchers.
I'm concerned broadly about privacy. I don't know what's been done with the data in the OPM hack, but the fact that it's never shown up on the dark web for sale concerns me. The fact that we're walking around with devices that track what we're reading and our physical locations, the increasing availability of compute power (and better algorithms) that makes it easier to solve image recognition problems -- I don't think people have caught up in terms of their behavior to this reality, although I'm probably less-concerned about large-scale releases of data and more concerned about foreign governments and other organized groups targeting specific individuals. Anyway, I have privacy concerns.
But I have trouble thinking of a scenario in which I'd go for Census data if I were trying to make money/blackmail people/generally sow chaos. I'm not saying there isn't one, but I'd like to see a possible list on this. Even if there isn't really anything nefarious about it, you could still want to lock it down -- if someone does a successful re-identification attack here, Americans are going to be concerned and that threatens Census, even if there aren't really large-scale negative effects in other ways. And Census has the advantage of being really big. But the richer data sets like ACS aren't that huge. And if you're putting in the time and computing power, presumably you can go after something like campaign finance donations, house sales, maybe that linkedin data that got scraped. Make a website with Trump supporters who work in certain industries or something. Or if you're trying to find rich people to target, there's housing sales and maybe LLC records, and just tons of other public data like who is donating money to various places. If you're trying to find really personal data on people, there's probably a dating app with APIs that aren't as secure as they could be, or something. Maybe if you're trying to get data on immigrants, or people getting certain public services, you could get that in Census and not a lot of other places. But you can probably do a lot of that with just zip code, and no one seems to be doing that now.
Basically, after decades of abuse, government scientists and administrators know that Republicans are itching to shut them down or destroy their work and that no one will protect them. Being in a permanent defensive crouch makes for bad decisions.
On one side, there’s a bunch of political scientists and economists who think they have the Constitutional right to have the taxpayers do data collection for them, and will now have a slightly harder time publishing papers that “discover” “causal effects”.
On the other side, there’s a bunch of theoretical computer scientists who, in the tradition of their field, have a very paranoid threat model and always focus on the worst possible outcome. It’s pretty rare for theoretical CS to have any real impact and they’re not going to give up on that opportunity.
I’m biased towards the computer scientists here, although the anti-differential-privacy crowd’s arguments seem stronger now than a year or two ago (as the two sides come to understand each other better).
If the non-noisy data is released, someone will do the database reconstruction attack, and it will become available on the Internet. It won’t be hypothetical for long. I don’t actually have a sense of how bad this would be in practice.
I still don’t understand what the bloody use case is. The government has all this data on every 1040 I’ve submitted. Google has or can parse it all together, as can at least Amazon and Netflix.
Probably every one of those “look people’s public records data up” has 90% of the census data and a bunch more beside.
What is this protecting us from?
Another example of making policy without cost benefit analysis.
My related beef is that medical communication has to be conducted by clunky "Portals" rather than old fashioned email. What is the expected value of the harm of someone hacking my email an learning my PSA at the dame time I do? Does it exceed the cost of establishing and using the "portal?" Much communication with the government suffers the same problem. Why does it need to be more difficult to look up something from my Social Security account than to buy something from Amazon?