132 Comments

I get the intellectual privacy concern, but people need to realize we lost that war without a fight over 20 years ago. All your data is easily acquired by anyone who wants it. There is nothing you can do about it, unless you are willing to live in a cave. And even then, wait til the satellite cameras get better resolution....

No doubt, this is bad for people who really do need protections - such as people trying to flee abusive ex spouses, people in witness protection, etc. But being mad about that is like being mad at an asteroid coming at you from space... be mad all you want, but the asteroid doesn't care about your feelings.

Expand full comment
deletedJan 26, 2022·edited Jan 26, 2022
Comment deleted
Expand full comment

Hi there! European living in the US here. With the exception of American healthcare, I can think of few things more stupid than GDPR. When I’m home, I frequently use an American VPN to browse the internet without all the pop ups. I understand that we wanted to hurt US tech companies, but we should have found a more user friendly way.

Expand full comment

I wouldn't hold my breath. They've just about wrangled Microsoft now about bundling, only 10 years after it was 10 years too late. Google and Apple are going to have so much data just from being the mobile providers that it's almost besides the point now.

Expand full comment
Comment deleted
Expand full comment
Comment deleted
Expand full comment

For big tech firms, it's a mixed bag as it makes operating more complex which basically disadvantages small and medium firms. CCPA is more stringent than GDPR in general but neither is incredibly impactful to Facebook or Google. They are already very good at getting their users to agree to [whatever] using features and network leverage.

tl;dr Google and Facebook are not much harmed

Expand full comment

On what basis do you view the CCPA as "more stringent" than the GDPR?

Expand full comment
Jan 27, 2022·edited Jan 27, 2022

Having led projects to comply with both under multiple products. It's generally been sufficient to focus on CCPA first and then the product is mostly GDPR compliant as a consequence (gap analysis reveals very little effort needed to bring it inline). That's conventional wisdom AFAIK, too. Do you have counter-examples? I'm sorry, it's not a discussion interesting enough to me to cite sources in the comments about. Basing it on multiple personal experiences and similar anecdotes from professional sources I trust.

Expand full comment
deletedJan 26, 2022·edited Jan 26, 2022
Comment deleted
Expand full comment

Given even odds, I'd put money on click-wrap licensing enduring in the US.

Also, the environment that allowed that cadre of small-town lawyers to target false advertising has shifted significantly. There's a massive pile of services and experiences that case law suggests should be WCAG 2.0 compliant in order to comply with the ADA but are not. There are some lawsuits filed over it but the quantity and quality of suits and outcomes is, at least for me, unsatisfying.

Expand full comment
Comment deleted
Expand full comment
founding

If big companies are going to have access to this information anyway, then why not make it available to the people who will do public good with this information too?

Expand full comment

The resources, time, etc for a company to get this data is literally so low that you might as well consider it a rounding error.

I would argue against doing a census due to how inefficient it is, since you could easily get the same data, at higher quality, through brokers. The census is largely a legal and political matter of how you count, not what you can actually know.

Expand full comment

Well, it's not free for you, me, or a company. We pay taxes.

(I'm sure we will all be paying less taxes now that we're getting a worse product, right?}

Expand full comment

Nope. More. It was more expensive to do this.

Expand full comment

I'm not sure where the "nope" part comes in. I didn't make any claims about the magnitude of the price.

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

> I'm sure we will all be paying less taxes now that we're getting a worse product, right?

You asked a (likely) sarcastic question. I answered "Nope" and elaborated. I'm not sure how that flow was hard to understand.

Expand full comment

Oh, I didn't realize you were responding to a parenthetical joke aside rather than disagreeing with my comment.

Expand full comment
Comment deleted
Expand full comment

You specifically said it's available to a company for free.

Expand full comment
Comment deleted
Expand full comment

But it's not free. They all paid taxes for it.

Expand full comment
Comment deleted
Expand full comment

SSNs are already easy to guess. Probably don't even need to post it.

https://www.science.org/content/article/social-security-numbers-are-easy-guess

Expand full comment

I’m now officially grateful my mom lost my card when I was a kid and she had to get me a new SSN.

Expand full comment

So are you saying it's not true that that info is available? I think it's likely true that it is available, but only available with effort and only available on average. No guarantees that you can get it for any specific person.

Thus I'm not willing to increase the chances to 100 percent that my info is obtainable.

Expand full comment
Comment deleted
Expand full comment

Good luck with that.

Expand full comment

Okay, but you seem to be under the impression that you were contradicting the root comment. Being able to lower information exposure doesn't do so.

Expand full comment
Comment deleted
Expand full comment

1. Well, if the information is already out there, the damage is already done. Nothing to be done about it, so you *should* stop fighting.

I'm not confident that the information is or is not already out there, but what we have here is a disagreement on an empirical fact. Some one just has to figure it out.

2. I do not think the root comment made this claim.

Expand full comment

Association of my real-world identity with my internet opinions != disclosure that an individual with my real-world properties exists.

Expand full comment
Comment deleted
Expand full comment

There are things we do in private, things we do in public, and things we do we do with specific groups of people. I think it's bad to invade private behavior, and extremely bad to leak across life-compartments (work vs. friends vs. family vs. internet), but the kind of stuff that's included in the census or the phone book is neither of those. It's just public.

Expand full comment

"privacy concerns" are a mind virus that infects all of the most neurotic, low-trust people in this country. they have been a real blight to the discourse on all kinds of issues. never once have I heard an American do a sky-is-falling routine about how some thing is going to destroy their privacy and then have that thing actually turn out to be bad. it's ok for people to know things about you. they won't use it to hurt you. see a therapist. etc.

Expand full comment

There are historic cases we should be concerned about, but like Uber employees tracking the locations of their ex girlfriends in real time. Those guys should be in jail and were rightly fired.

The census crap is part of a long tradition of terrible ideas coming out of academic ethics studies because the reasonable ideas are obvious and won't get anyone tenure. The ideas then got popular in social forums where no cost benefit analysis is applied and tada!

Expand full comment
founding

Do you think Germans have this worse or better? They seem to care a lot more about privacy and do things like ban Google street view and pass GDPR. I can’t quite tell if they’re doing things with better cost/benefit ratio, but they’re doing bigger things.

Expand full comment
Comment deleted
Expand full comment

Do you understand how many millions of deaths medical privacy is responsible for? I literally can't get my child vaccinated right now because of medical fucking privacy making it so insanely expensive and difficult to do clinical studies.

We have all kinds of information about what works and what doesn't. It's all locked away and useless. Even the British national health service has only in the last year started to actually use their vast stores of data to advance medicine, and it took covid to get them going.

Expand full comment
Comment deleted
Expand full comment

This feels like I'm stalking your comments, but I swear I'm just reading the comments here and you replied to a lot of other comments!

At what point do you reverse your opinion? Like, if automatic opt-in saves, on net, a thousand lives? 1 million lives? 100 million? 1 billion?

Maybe you question the basis of the question. Maybe automatic opt-in doesn't save any lives! This is another empirical question we should maybe investigate before we commit to anything.

(I will also note that in another comment you said you were fine with opt-out...which is a position I think more people would agree with you on)

Expand full comment
Comment deleted
Expand full comment

> That being said, I think you're arguing from utilitarianism of a sort, correct?

Utilitarianism is a way to frame it. I'm not specifically arguing from any framework, I'm just saying that no matter the framework you're using, realizing that you're making a tradeoff is important. Once you've acknowledged and internalized that fact, then you can decide what you want to do with it.

Maybe to you privacy is such an absolute right that it doesn't matter the outcomes. Maybe you decide that actually, the outcomes do matter so lets actually figure out what the outcomes are.

It seems to me that the knowing the tradeoffs are fundamental to figuring out what actions to take.

Expand full comment

I'd be happy to post mine! I take sertraline 75mg daily for depression/anxiety and have minor electric issues with my heart. I lost my sense of smell as a child as a result of a severe infection. That's about it.

I feel absolutely no risk from anyone knowing this information. Now, I understand that some things really are sensitive (abortion history, some surgical procedures, etc.) in that the probability of someone being able to use that information against you in a material way is actually somewhat significant. It makes more sense to enforce privacy protections on a limited, case-by-case basis for those areas than to adopt a wide-ranging ethos of "privacy everywhere all the time even at cost".

Expand full comment

This is amazing thank you

Expand full comment
deletedJan 26, 2022·edited Jan 26, 2022
Comment deleted
Expand full comment

If your goal is to get me to admit that there is some line at which privacy becomes important, I agree! There is some line at which it becomes important, which is long past the line of "a dedicated person can reverse-engineer your census answers". But then again, I do have several cameras in my home which are connected to the internet (webcams) as well as a microphone that is absolutely spying on me (Alexa) and it doesn't really bother me.

The uncertainty you have in believing me is probably around the same as the uncertainty any attacker would have in a database reconstruction attack -- less, in fact. Database reconstruction is not foolproof. Actually, it's pretty bad when it comes to trying to get information about a single individual. Beyond that, like many people do already, I could've simply falsified some of my census responses just like I can falsify my description of my medical history. I didn't, since I don't have the zero-trust mindset. But I could've! In fact, I bet I could relatively easily create doctored medical records as well, if for some reason I wanted to do that.

Point being, you are drastically overestimating the risk from this. You seem to think that if there's any risk of a person's privacy being compromised, it is much the same as them losing all of their privacy in all situations and being exposed to the world. In reality, it's more like with dedicated effort a very well-resourced person could identify your race, address, marital status, and other unimportant information with ~40% accuracy.

Expand full comment
Comment deleted
Expand full comment

No, it frustrates me that they expect our government and institutions to be worse at their jobs in deference to their zero-trust personality.

Expand full comment

While they're at it, could the Census Bureau also give everyone a free ID, and solve the voter ID controversy once and for all, as well as alleviate a lot of the other hardships people encounter when they don't have a valid ID? Issuing national IDs would be a reasonable method of implementing the Constitutional mandate to maintain an "actual enumeration" of every person in the country.

Expand full comment

If you think there are a lot of tinfoil-hat-wearing nutters in the antivax crowd, you just wait to see what happens when the Census Bureau floats the idea of a national ID card.

Expand full comment

That's a good comparison. The anti-national-ID folks generally are further around the bend than the anti-vaxx folks, in terms of the quality and honesty of the arguments they make.

Expand full comment

"The anti-national-ID folks generally are further around the bend than the anti-vaxx folks..."

They've had a lot more time to hone their arguments.

Expand full comment

I used to be anti-national-ID but then SSN basically ended up AS a national ID, but without being treated with the appropriate standards (non-uniqueness, people assume it's "secret" when it clearly isn't) so I've decided that the horse is already out of the barn, so we might as well at least do it _correctly_.

This strikes me as similar to Matt's argument about the census and privacy but I will say _how_ a National ID is implemented does matter to me, and I don't think we should ignore census privacy concerns altogether.

Expand full comment

What is the point of having a national ID? What problem is it trying to solve?

Expand full comment

Identity theft. To access money or credit in the US, you just need to provide basic biographical facts (SSN, DOB, etc). Perhaps an image of your drivers license. Once those facts/images are disclosed, whoever has them can reuse them to impersonate you. A modern purpose-built identity scheme can do authentication in such a way that proving your identity to someone is not the same as giving them the ability to impersonate you.

In theory all 50 states could agree on some standard for their drivers licenses which would have this functionality, it doesn't exactly have to be national, but it would be much nicer if it were.

Expand full comment

That was my thought process.

However...

1) Think of all the places we currently use SSN as a proxy for this - medical insurance, etc. It's obviously pretty handy to have a unique identifier(names aren't unique although DOB helps make them unique - but names also change etc.)

2) Think of places where they currently ask for a Driver's license number - for at least some services we think that's reasonable to provide - but why should that be tied to a particular state or whether I drive?

3) Voting... If we want to say you need an ID to vote(which is not by itself an unreasonable request) then why not have a national one?

Expand full comment

The Baileys don't give a shit about the Census. Changing it just risks drawing their attention to it and scaring them.

Expand full comment

No, clearly the Baileys approve of my preferred policy

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

I did a lot of research on this for my MPP a few months ago (and was saved a lot of time by the links in Matt's last piece about this, so thank you Matt!). I came away with the suspicion that this is in large part a hobby horse of John Abowd, economist and Chief Scientist for the Census Bureau. That's very speculative on my part - I'm an onlooker, not someone intimately involved in the issue, so everyone reading this should very much take what I say below with a grain of salt - but here's what made me get that impression.

If you listen to how the Bureau talks about differential privacy, it doesn’t present itself as aiming for some legal threshold. Instead, it talks about trying to optimize a bundle of data usability and data privacy based on Americans’ preferences.

John Abowd has published several papers arguing differential privacy allows us to quantify the trade-off between privacy and useability. He thinks this is great, because if we connect it to survey data on how much people value privacy and usability, it lets us find the bundle that maximizes utility.

Most people, including me, think this is BS. Differential privacy only lets you quantify the risk of exact attribute disclosure (the risk of correcting deducing the value of a specific datapoint – say, the age of a specific person in the microdata). It has nothing to say about identity disclosure risk (the risk of correct attributes being connected to a person's PII using a third party database). And identity disclosure is (1) the only type of privacy people actually care about, and (2) the only type of privacy Title 13 cares about (the law that says the Bureau can’t release PII). After all, the Bureau has been releasing exact attributes in microdata for a while now!

But the bureau still uses the optimization framework in its public communications. It gives me the impression that John Abowd is just really caught up in his research angle. Again, that’s very much speculation on my part, but I don’t think it’s unreasonable. Abowd has a weird interpretation of Title 13 - he believes that "re-identification risk is only one part of the Census Bureau’s statutory obligation to protect confidentiality. The statute also requires protection against exact attribute disclosure.” I haven't been able to find *anybody* else who thinks this, so I can't help but wonder if it's motivated reasoning.

Expand full comment

Sorry -- I'm like weirdly fired up about defending John Abowd for no particular reason except that he was nice to me once and I like his early research.

Prior to working on this *precise* topic, Abowd got really into the idea of expanding access to data. The Census Bureau collects longitudinal data on businesses that lets researchers observe layoffs and closures by industry, firm size, age, etc. It's important data that could be useful for a bunch of important policy analyses. But it's basically impossible to get access to. So he championed a new product that used artificial, simulated data that closely mirrored the true data and was easy to get access to. There are a bunch of examples of data like this where he's been pushing to expand access: Data that links IRS income information to the Census, data that links individual-level UI employment histories across datasets and over-time. I think he genuinely wants to live in a world where more people are able to use the data that he thinks policy should be based on. But the counterargument that keeps all that stuff restricted is always about privacy. I think that's what led him to pursue data suppression approaches that can preserve privacy. So I think it's not quite right to think about this issue as "How are the datasets we're already used to using affected by differential privacy?" because I think most of what he had in mind is more like "Will universally adopted differential privacy approaches allow us to use other important datasets that could add value to policy debates but which we're not used to having access to?"

Expand full comment

Just to add (I thought in response to a JG comment that I can’t find anymore): I think we all get really numb to some of the data limitations that we’re used to. If you look at IPUMS USA, there are tons of restrictions on the PUMS samples from the 1970 census (state samples, MSA samples, other weird and confusing stuff). The ACS individual level data doesn’t disclose MSA for lots of respondents and most MSA’s are partially and non-randomly disclosed. The CPS doesn’t disclose MSA almost ever. There are plenty of examples of stuff like that, and it’s all motivated by protecting privacy. Those are really useful things that would be great to be able to use, but we get used to the fact that it isn’t possible. Differential privacy can, in principle, help with these things. Having inaccurate counts is bad, but hardly the biggest problem with the implementation of Congressional districting. Is that the price to have the ability to use microdata to study MSA’s in a more comprehensive way? I don’t know if that’s worth it, but I think this is why I’m a little disappointed that there’s never a dialog between the two sides. One said gives costs, one side gives benefits, and there isn’t a conversation to compare those in a meaningful way.

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

Ah sorry, I deleted my earlier reply because I didn't feel like it contributed anything of value.

But I very much agree that we need some way of actually comparing the costs and benefits of privacy protection. Abowd's production possibilities frontier (in this paper https://ecommons.cornell.edu/bitstream/handle/1813/39081.1/abowd-schmutte-privacy_20150129.pdf?sequence=2&isAllowed=y) is a really elegant way of doing that, so I get the appeal. I just don't think it works, given that when people are asked how much they value privacy, they're thinking of the risk of reidentification - not attribute disclosure.

But here's my own more crude attempt at a cost-benefit analysis.

Census and ACS microdata is extremely important for researchers in policy, public health, social science, etc. It supports important government functions like redistricting and disaster response. So the benefit of accurate Census data is *really* high.

The cost of accurate census and ACS data is dependent on reidentification risk. That doesn't seem very high to me based on the Census Bureau's expriment with reidentification, but I could be wrong about that - I know lots of people disagree here.

If I'm wrong and reidentification risk is much higher, I'm not sure what type of costs we're looking at (ignoring Title 13 requirements here). Most Census data seems pretty harmless to me. I think the most compelling cost I've heard involves detecting occupancy violations, but it doesn't seem to me that the people in charge of that sort of thing are the people who would be able to do database reconstruction and reidentification.

So that's why I wind up against differential privacy in Census and ACS data at the moment. But I could be peruaded otherwise by a compelling case for higher costs.

But if no larger agreement can be reached between researchers and the Census Bureau yet, I think a good starting point could be the Bureau releasing the differentially private data without the TopDown adjustments. I wrote another comment on this - basically, the worst distortions don't actually come from the noise; they come from trying to make the nosiy data cohere across multiple geographies. I'm suprised the Bureau hasn't talked about doing this yet (as far as I know). That data is still differentially private, but also more useable. Seems like a win-win.

Expand full comment

> ...Abowd has a weird interpretation of Title 13...

Speaking of weird things: It's weird to me that just some basically unaccountable dude's own interpretation of a law has such a large effect on a public agency. Of course, this isn't really unique to The Census Bureau, it's just a weird quirk of the way our government works.

Expand full comment

I can't really speak to the hierarchy of who made the call and when - I'm not involved enough with the Census for that. Another commenter said they think the decision was made before Abowd came on board, and they may be right. But I do think Abowd's opinions carry a lot of weight in this space regardless.

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

Super interesting!

Watching this video, Abowd calls out "Steve Ruggles" as a man who disagrees with him on this. From your MPP, know anything about him, their arguments etc?

https://youtu.be/yUyCYC6rb_4

Edit: oh yeah, at 19 minutes in he has his privacy/accuracy tradeoff graph.

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

Ruggles is leading the charge from the social science community against the Census Bureau's changes. He put together this report back in 2018 that is still the best critical response to the Bureau I've been able to find:

https://assets.ipums.org/_files/mpc/wp2018-06.pdf

I find Ruggles's arguments convincing, but clearly they didn't win the day before the last census. I think actual demonstrations of the issues might persuade more people. For example, the fact that the differentially private data underestimate minority populations and throw off redistricting:

https://alarm-redist.github.io/posts/2021-05-28-census-das/

Or that public health statistics might be thrown off:

https://www.pnas.org/content/117/24/13405

Expand full comment

I agree this is interesting, and I am also not intimately familiar with what Abowd is thinking, but I think it's important to keep in mind that 1) he had lots of important and influential work on lots of topics before he started working on this and 2) the policy decisions and actual implementation on this stuff came before any of his academic publications on the topic. So I don't think he's pursuing policy changes to ex post justify his research, but I think he actually came to this line of research from the operational policy decisions and debates.

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

I think that's fair. To be clear, I'm not trying to imply Abowd has any malicious intent - I think most of us tend to favor the types of solutions connected to our interests and work. What makes me raise my eyebrows is (1) unacknowledged issues with his social welfare function analysis, which I describe in the original post, and (2) his statements around the legal question.

To address the legal question a bit more: Title 13 forbids the Census Bureau from "mak[ing] any publication whereby the data furnished by any particular establishment or individual under this title can be identified." For most of the Bureau's history, they've interpreted that to mean they can't share things like names - but sharing accurate attributes without personal identification info is fine.

Abowd doesn't seem to agree with this:

https://blogs.cornell.edu/abowd/special-materials/245-2/.

Abowd says in point 15 on that link that Title 13 protects not just against reidentification, but also against "exact attribute disclosure." That's really weird to me - if it was the case, the Census Bureau wouldn't have been okay with releasing accurate microdata in the past. But maybe he just means that if attribute disclosure creates high risk of identity disclosure, Title 13 would require protection against release of exact attributes as well?

Well...it's hard to tell, because in point 24 on that link, Abowd also says "You are free to take issue with this risk assessment, but the statutory confidentiality protection obligation is the domain of the Census Bureau, and the protections of Title 13, section 9 are not subject to a “when convenient” exception." My best read of what he's saying here is that regardless of the risk assessment, Title 13 requires differential privacy. That doesn't make any sense - as I discussed above, the question of reidentification risk necessarily proceeds the question of protection against attribute disclosure.

So maybe Abowd has a really weird interpretation of Title 13. Or, I guess, he could be doing some circular reasoning. Or, he's being evasive. Or I'm interpreting his statements incorrectly. I'm really not sure.

Expand full comment

Really weird or motivated interpretation of Title 13. When you've decided to do something, many things start to look like support for that decision.

Expand full comment

This is all very reasonable, but John Abowd at Cornell Economics led the Census Bureau's charge on this, and while he may be wrong on this issue, he's neither evil nor stupid. For all the controversy and criticism about this change, I'm shocked that there's no "John Abowd faces his critics" type interview anywhere. Presumably he has SOME responses to these critiques, but I feel like the conversation never directly engages both sides of the argument and people kind of just talk past each other.

Expand full comment

This is a fascinating topic for me to discuss because while I was in applied math grad school at MIT, my research was tangentially related to the academic discipline of differential privacy, which is generally seen as one of the hot up-and-coming fields in computer science. It is genuinely exciting for many of those researchers to be able to see real-world applications on something as high profile as the Census, and while I'm not personally motivated by privacy concerns, I sympathize with the attempt to try to quantitatively balance privacy and accuracy.

One of the things that I would emphasize in this discussion is that all of the various knobs in these algorithms are adjustable. If they're indeed overvaluing privacy as you claim, they can always turn the amount of noise down to produce generally more accurate counts the next time around. If the adjustment algorithms prioritize getting the wrong counts right, they can adjust the algorithm to get the counts that matter more right. Just as you're questioning the value of the privacy offered, one could also question the value of exact accuracy, and perhaps attempt to offer an actual cost-benefit analysis. I fully expect academics to dissect this a thousand ways and be able to come to some sort of general consensus of the different options and their tradeoffs over time.

I'd also just emphasize how new all of this. DP as an academic discipline really only got kicked off with a 2006 paper establishing how noise injection leads to privacy. There's been a lot of academic work since then, but handling the complexities of applying this to real world data is just getting started.

Expand full comment

Isn't part of the problem that it's hard to prove tight bounds on how well an algorithm preserves privacy? So e.g. if we decide to use an algorithm that allows an at most 5% probability of someone's identity being uncovered, it might actually be much lower than 5%, and maybe we could have achieved 5% with much less noise introduced. (Maybe even the old "swapping" technique was already sufficient to achieve eps-differential privacy, but we just couldn't prove that it works!)

Expand full comment

I think this answer is true in a general sense -- I'd just shorten it to saying that it's hard to prove things -- but these criticisms don't necessarily hold in this particular case.

First, the definition of differential privacy doesn't reason in terms of a probability of someone's identity being uncovered; it's quite a bit more robust than that. I would explain it as an upper bound on the amount of information anyone (with any amount of time or computing power on their hands) can glean about any individual person from the data released. (Here by "information" I technically mean something like "log odds ratio" for those with more technical background.)

Second, yes, there can often be some slack in the sorts of bounds that can be proven, but computer scientists are also always trying to characterize the other side of the equation with concrete examples. At first they always start with theoretical scenarios, because that's what you can most easily prove things about, but with all of this real world data, I fully expect plenty of people to try to quantify the actual amount of information released by this technique. We know based on the proofs that there's some upper bound, but how close do we get to it in reality? There are lots of interesting (and now, important) questions to be uncovered there, and I fully expect a robust literature to develop to answer that question, if it hasn't already.

Third, while I'm not directly familiar with the swapping technique, I would fully expect that they've already demonstrated that previous techniques had problems. That's like a prerequisite to the field even being interesting in the first place, and speaking from experience hearing various talks about DP, they often have several stories to share about privacy bring broken in unexpected ways.

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

I think it's time to retire the tired trope that private companies know everything about us when the evidence given is some case of a teenage pregnancy back in 2014. If that's all the evidence we can muster, well, then you know . . . My experience is that company tracking is pretty dumb. It's the bounteous ads I get because *I just bought the damn thing they're advertising.* Or you google something once and you get obnoxious ads peppering your screen for it for months, until you swear you will never use that service ever (looking at you, zipcar. com)

As for the real substance of this post, I'd say my bias is toward more protection of personal information from our government than less, but I admit that that's not a slam dunk position of mine. Like so much of life, unlike my hatred of zipcar. com ads, it's complicated.

Expand full comment

If you go to any website with chat support, not even logged in, and strike up a conversation, the support specialist (or bot) will generally have a profile of who you are, your other emails and screen names, your employer, etc. Not everyone will have all of that data complete in their profile and it's not that hard to defeat if you're at all tech savvy but it's a mundane, everyday example they won't know if you're pregnant generally but that's because it's not worth knowing in that context compared to the cost of harvesting, curating, and modeling that data.

Here's another example, blood banks aren't governed by HIPAA. My local blood bank, where I'm a donor, gets fussy a few weeks before I can donate again and starts hitting me with ads about every time. Back when I still used facebook, they would put ads on my facebook page, targeted at *me*, with a message like, "The need for B- is at an all time high!". That was in plain-text. So, facebook, in some dark database there's no way of knowing if they care enough to mine, knows some interesting things about me. My blood type (which has covariance with other genetic, racial, and health attributes), information about my likely sex life (I think at the time you weren't able to give blood if you were homosexual but now I think it's been changed to you can give blood as long as you haven't had same sex sexual contact in the last X days), my travel history, etc.

I get the Target example is overused, that doesn't prove the negative, though.

Expand full comment

Yeah, the Red Cross tells me when it's been at least eight weeks since my last donation. Would the Red Cross put an ad on Facebook telling me the same thing? Don't know; not on Facebook.

Anyway, this is pretty weak tea for a panopticon world.

And if companies or bad actors are trying to use blood type to supposedly discover things about you you'd rather not have them know, then I suspect they're being snowed. Blood type? Really?

Expand full comment

I used 60 seconds to come up with top of mind examples. The missed point is that these things are mundane and banal and everywhere. The largest barrier to companies "knowing" something about you is that the apparent value exploitation of that knowledge could deliver is lower than the cost to organize that data. The examples of value are becoming more accessible and the cost to organize is getting lower, both incrementally.

If you read that context as uninteresting, that's okay.

Expand full comment

think about how much collective time has been wasted by all of clicking "Allow Cookies"

like, who exactly are these privacy nazis? Who is it that cares this much?

Expand full comment
founding

Germans. I think they are technically privacy anti-Nazis and anti-Stasi.

Expand full comment

if there are 500m internet users in the US and Europe (there are more), and they take just one second per day clicking "Allow Cookies", then that means we're wasting 15.85 years of human life every day. Every single day, these privacy nazis are doing the equivalent of locking someone up for a decade and a half. That is unconscionable.

Expand full comment

I share your hate for these boxes. I use a Chrome extension that takes care of most of these: https://chrome.google.com/webstore/detail/i-dont-care-about-cookies/fihnjjcciajhdojfnbdddfaoknhalnja?hl=en

Expand full comment

Hi All. I run a privacy think tank (FPF.org) - a pragmatic centrist think tank focused on helping support responsible uses of data, working with the Chief Privacy Officers of many data driven companies, researchers, cities and schools....I am a data optimist and enthusiast, not an overly cautious fanatic. I have closely followed this issue. My informed view is 1) the actual privacy risk of releasing large overlapping sets of data is real. The Census will never be trusted if we release data as in the past and it is used to disclose details people consider confidential. 2) For many years, the Census has added noise to data to support deidentification. Many users of these data sets havent been aware, or simply relied on the final data. Now the Census is transparently explaining the techniques used. 3) Changing the way data is released is painful due to the transitions that will need to be made by those relying on the previous techniques, and the tradeoffs required by the protections needed. 4) Differential privacy is currently the most sophisticated way to assess techniques that add some noise to data, in a manner that maintains accuracy at aggregated levels. My interview with one of the "inventors" of differential privacy can be viewed here. https://www.linkedin.com/video/live/urn:li:ugcPost:6783434965402075136/

Expand full comment

In that long paragraph, you don't even come close to making an actual argument in favor of your position. What exactly is the "risk" and how does it compare to the massive amounts of data that people are constantly handing over to private companies every second of every day.

Expand full comment

Happy to post links to the extensive debates by technical experts on this...will do so as soon as I can today. With regard to the data I provide to Google and the rest of the companies I interact with - they do not make it public, and I dont have to use those services, or can choose less data intensive alternatives. This data is published to the world. How many people are in my household is no big deal to some of us - to far more vulnerable people, the census data being identified can be damaging.

Expand full comment

I'm not reading extensive technical debates. If the argument can't be explained clearly and shortly then its probably not very good.

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

My impression from the Census Bureau's big reconstruction and reidentification attempt was that (1) it only successfuly reidentified a minority of people, and (2) a hypothetical attacker would have no way of knowing how successful their reidentification attempt was, because that attacker would not have access to the original identifying information that the Census Bureau does.

My takeaway from this was that reidentification doesn't seem to be a huge risk yet - am I getting something wrong here?

Also, it's hard for me to imagine what harmful things would befall people if Census data were released, given most of it is pretty mundane (age, race, etc.). What are your thoughts on that?

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

As always, the question I want an answer to is RISKY COMPARED TO WHAT?

Consider that for most of the latter half of the 20th century, AT&T -- as a treasured public service! -- distributed printed books listing the name and address of the vast majority of people in the country. Everyone who lived in the same city got the names and addresses of all of their neighbors, and most major libraries got copies of the whole shebang. They also published or licensed the publishing of "reverse directories" that linked numbers and addresses back to names.

We somehow managed to survive this era without whatever privacy apocalypse is being worried about here, so I'd really like to know what the scenario is with the census that is so much worse that it justifies eliding the raw data.

Expand full comment

Especially in a world where Equifax hacks have already exposed all of the genuinely interesting information about individuals already, right on down to the complete salary histories of 125 million of us.

Expand full comment

yeah so like what's the marginal benefit that's being achieved here?

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

Here's a partial solution to the issue - as well as a slight correction on Matt’s description of noise at larger geographies versus smaller ones.

The biggest distortions don’t come from differential privacy on its own. The noise added is pretty small, and is normally distributed which is good.

The worst distortions come from modifications made to the correct for problems the normally-distributed noise causes:

(1) Values that look weird – say, a Census Block with negative 3 houses, or a county with 800.354 people – are changed to look normal.

(2) Data for different levels are modified so all levels line up (for example, the population of states might be modified to sum to the population of the country). They have to do this because the noise gets added to summary data at every level, making different levels not cohere with one another. They start modifying at the top and work down, which is why smallest geographies are the worst - they're affected by all the modifications at the levels above them.

A bunch of researchers have suggested the Bureau also release data with just the original normally distributed noise - not the corrections for weird values and for alignments across levels. Last I heard, the Bureau wasn’t considering this, which is super weird - the data with just the normally distributed noise is still differentially private, and would be *way* more useful to researchers.

Expand full comment

I'm concerned broadly about privacy. I don't know what's been done with the data in the OPM hack, but the fact that it's never shown up on the dark web for sale concerns me. The fact that we're walking around with devices that track what we're reading and our physical locations, the increasing availability of compute power (and better algorithms) that makes it easier to solve image recognition problems -- I don't think people have caught up in terms of their behavior to this reality, although I'm probably less-concerned about large-scale releases of data and more concerned about foreign governments and other organized groups targeting specific individuals. Anyway, I have privacy concerns.

But I have trouble thinking of a scenario in which I'd go for Census data if I were trying to make money/blackmail people/generally sow chaos. I'm not saying there isn't one, but I'd like to see a possible list on this. Even if there isn't really anything nefarious about it, you could still want to lock it down -- if someone does a successful re-identification attack here, Americans are going to be concerned and that threatens Census, even if there aren't really large-scale negative effects in other ways. And Census has the advantage of being really big. But the richer data sets like ACS aren't that huge. And if you're putting in the time and computing power, presumably you can go after something like campaign finance donations, house sales, maybe that linkedin data that got scraped. Make a website with Trump supporters who work in certain industries or something. Or if you're trying to find rich people to target, there's housing sales and maybe LLC records, and just tons of other public data like who is donating money to various places. If you're trying to find really personal data on people, there's probably a dating app with APIs that aren't as secure as they could be, or something. Maybe if you're trying to get data on immigrants, or people getting certain public services, you could get that in Census and not a lot of other places. But you can probably do a lot of that with just zip code, and no one seems to be doing that now.

Expand full comment

A long time ago, Netflix released anonymized user ratings of content. People joined that to IMDB and were *sometimes* (or to a certain limited probability) able to connect their private behavior (Netflix watching) with their public behavior (listing/favoriting movies on IMDB). So, if you went through and favorably listed a unique set of movies on IMDB connected to your main email / real name, you weren't expecting Netflix to release that same "fingerprint" and let everyone know that you also watched the hell out of "Fifty Shades of Gray" and loved it.

It's a pretty niche privacy scenario but everyone with a midwit understanding of big data and privacy wields that example like a club and sometimes calls it "The Netflix Hack" (if using publicly released data in a novel way can be called a hack...). I think the real lesson from the "Netflix Hack" (and a big lesson if you follow any of the cases where the FBI has caught people who release pirated video and software) is that slices of your life you and your vendors put on the internet are only moderately difficult to triangulate and resolve a complete identity of public and private behavior to anyone with basic data and automation skills and a little bit of a sleuth's mindset. Fuzzing Census data is destructive and useless.

Expand full comment

Pretty slick.

Even better, many years ago, I worked for a company that -- and this is true -- found out the name, address, and phone number of every person in the community, and delivered that in a big package to everyone once a year. Worse than that, I was responsible for putting together daily changes and delivering that to the company where they provided a phone number you could call and get that latest privacy information!

I do wonder whatever happened to Southern Bell.

Expand full comment

Basically, after decades of abuse, government scientists and administrators know that Republicans are itching to shut them down or destroy their work and that no one will protect them. Being in a permanent defensive crouch makes for bad decisions.

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

On one side, there’s a bunch of political scientists and economists who think they have the Constitutional right to have the taxpayers do data collection for them, and will now have a slightly harder time publishing papers that “discover” “causal effects”.

On the other side, there’s a bunch of theoretical computer scientists who, in the tradition of their field, have a very paranoid threat model and always focus on the worst possible outcome. It’s pretty rare for theoretical CS to have any real impact and they’re not going to give up on that opportunity.

I’m biased towards the computer scientists here, although the anti-differential-privacy crowd’s arguments seem stronger now than a year or two ago (as the two sides come to understand each other better).

If the non-noisy data is released, someone will do the database reconstruction attack, and it will become available on the Internet. It won’t be hypothetical for long. I don’t actually have a sense of how bad this would be in practice.

Expand full comment

I still don’t understand what the bloody use case is. The government has all this data on every 1040 I’ve submitted. Google has or can parse it all together, as can at least Amazon and Netflix.

Probably every one of those “look people’s public records data up” has 90% of the census data and a bunch more beside.

What is this protecting us from?

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

It's not about the government having it. You can get person-level Census data, including ACS, as a researcher (or just a person) via ipums. You can't get people's tax data that way. And there's a lot more in ACS than what you can find online about people. (At least, unless you're going to certain places on the web that I am not.)

Expand full comment

> You can get person-level Census data, including ACS, as a researcher

Yes. It's a tightly controlled process.

> You can get person-level Census data, including ACS, as just a person

No. I get that in IPUMS own descriptions of itself, it uses the term "individual level data". It's still aggregated and small crosstabs are suppressed. The smallest area is a unit of 100k population. IPUMS does have access to the individual source data but that's not what gets distributed.

Can you share an example of supposed individual data you are getting from IPUMS? I'm afraid we're talking past each other here and I think this would clear it up.

Expand full comment
Jan 26, 2022·edited Jan 26, 2022

Sure. But I think we must be talking about the same thing, and just referring to it differently. You go here, you check out some data: https://usa.ipums.org/usa-action/variables/group. The resulting data file is microdata, or "person-level", in that each line represents an individual. But it's not exactly individual-level data, because they do add some noise to it - I don't know if they always did that, or if that's new. It's been years since I worked with this.

Because of privacy issues, they also do some truncation and they remove zip code/block. So you can't actually get the salaries of high earners, and you're looking at the county/PUMA level or whatever for each person.

When I was using this data, we definitely referred to it as "person-level". With the noise added, I'm not sure what I'd call it. It's sort of "fake-person-level," maybe. But I wouldn't really call it aggregated.

Expand full comment
Comment deleted
Expand full comment

Not seeing the logic that led you here?

Expand full comment

Another example of making policy without cost benefit analysis.

My related beef is that medical communication has to be conducted by clunky "Portals" rather than old fashioned email. What is the expected value of the harm of someone hacking my email an learning my PSA at the dame time I do? Does it exceed the cost of establishing and using the "portal?" Much communication with the government suffers the same problem. Why does it need to be more difficult to look up something from my Social Security account than to buy something from Amazon?

Expand full comment

Email is uniquely terrible in a way that makes this necessary. If it were e.g. Whatsapp it would be fine. But email is transparent to too damn many intermediaries.

Expand full comment

Whatsapp would be OK with me, but it's pretty hilarious to imagine MY provider with a whatsapp account. He's just about as likely to be in Tic Toc! :)

Expand full comment
founding

I think the medical stuff is more about your spouse or parent who might share your email account getting your private medical records.

Expand full comment

Anyone who shares an email account probably does not value that kind of medical privacy very highly. :)

Expand full comment
founding

That’s one of those too-convenient sour grapes assumptions, like “don’t worry about biased job interviewers because you don’t want to work for any company with biased people anyway”.

I don’t know how many children share email addresses with their parents, but if they start getting STD tests they probably don’t want their parents knowing, even if they didn’t mind sharing stuff with their parents when they were 13.

Expand full comment

Maybe I'm to categorical, but my feeling is that on balance more costs are imposed than benefits created by these privacy rules. Perhaps if one could opt out. Like the easier to open and more secure medicine containers. :)

Expand full comment
founding

I definitely believe that there are more costs than benefits here (especially when it comes to things like my doctor saying "we can't e-mail you the x-ray, but I won't object if you take a picture of my computer screen right now") but I just want to be sure that we're not pretending that the benefits are zero.

Expand full comment

And I'm making sure we don't think the costs are zero. We are non-zero guys. :)

Expand full comment