Back on August 16, I noted that the Census Bureau has begun to deliberately insert errors into the block-level data it releases to the public.
This strategy is called “differential privacy,” and it’s supposed to help safeguard against the hypothetical scenario in which someone uses powerful computers to analyze fine-grained Census data to obtain information about individuals. Not to fully recount the previous post, but I think this is a low-value undertaking — unless you are a weirdo living off the grid, basic Census-style information about where you live, your age, marital status, and race and gender identities is widely available to commercial actors. Private companies are already scrutinizing your purchase patterns to determine that you are pregnant and then potentially getting in hot water for notifying relatives you haven’t actually informed.
Given that baseline reality, compromising the usefulness of Census data for the sake of minimal privacy gains seems really bad.
There have been two unfortunate developments since my previous post on this. One is we now have studies showing that some of the specific fears about the inaccurate data were justified — rural populations and minority populations are going to be miscounted and the apportionment of legislative seats is likely to be screwed up.
The other is that the Census Bureau is rolling out new pro-privacy, anti-accuracy ideas that are going to make American Community Survey data worse. They say that this is required by statute. I am not an attorney and their General Counsel is, so they may be correct. But they haven’t lost a lawsuit over this. And I don’t see any particular evidence that Congress is agitating for this outcome. I also think that due to a separate set of Census controversies, none of the Biden appointees at the Commerce Department or the White House are likely to want to get involved in bossing the Census around.
All of which is to say that I hope the career people at Census will reconsider their approach to the statutory issue. And even more so, I hope that members of Congress will urge the Census to reconsider and also work on legislation that could save the Census and make it useful.
Some Census Bureau fundamentals
As an institution, the Census Bureau is a bit of an odd duck. The Constitution rather casually requires an “actual Enumeration” of the population to be “made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years” but otherwise leaves the question of how to do this up to congress.
In principle, all the constitutional Census requires is an accurate count of how many people live in each state — the text simply says the enumeration should be used to apportion the number of House seats and electoral votes each state gets. But when the Census Bureau was set up, Congress had it gather more detailed geographical information. So even though the constitutional provision doesn’t require that the Census count how many people live in a given county or town, in practice, the Census has been collecting and publishing that information for a long time.
Much later, the Supreme Court ruled in the 1962 Baker v. Carr decision that outside of the special case of the U.S. Senate, all legislative districts need to have roughly equal populations. Whether we’re talking about a U.S. House seat, a state senate seat, or a city council district, the districts must be drawn fairly.
That’s a good idea, but it depends on the Census providing accurate small-area population counts, something that it has in fact been doing but that is not a constitutional requirement.
More broadly, over the decades (centuries, really) Congress and the Census Bureau have felt that as long as there is a government agency conducting surveys and counting stuff, it ought to put out a bunch of useful statistical information above and beyond its narrow apportionment role. Some of that has been done over the years as part of the decennial census, but a lot of it is now done as separate surveys. After all, one problem with trying to count every single person in the country is that it’s very expensive and labor-intensive. Another problem is that it basically guarantees an undercount. The Census Bureau has started using statistical sampling to produce counts that are both more accurate and — because the method is less expensive — more frequent. Republicans, however, feel that statistical sampling will disadvantage them in redistricting battles and have successfully propounded the theory that an “actual enumeration” means no sampling.
Regardless of the merits, this means that the Census Bureau does both a decennial census for redistricting purposes and also a more frequent American Community Survey that gives us lots of information based on statistical sampling.
“Differential privacy” is messing up the Census
The idea behind the differential privacy strategy, as I understand it, is that you can basically introduce statistical noise into the block-level results in a way that will make it impossible to reverse-engineer the demographic profile of people who live at specific addresses.
This has, I think, roughly zero real-world privacy value, but it’s what the Census has decided the law requires them to do. The problem with introducing noise is that sometimes it’s useful to look up block-level information, in which case it would be nice if that information didn’t have deliberate inaccuracies. Census’ view of this is that it’s not so bad because as you aggregate up to higher-level geographies (block groups, tracts, counties), the noise cancels out.
But Christopher Kenny, Shiro Kuriwaki, Cory McCartan, Evan Rosenman, Tyler Simko, and Kosuke Imai find that you are left with significant errors at the level of electoral precincts. The new Census Bureau method “systematically undercounts the population in mixed-race and mixed-partisan precincts, yielding unpredictable racial and partisan biases.” If you aggregate up the precincts into counties, the errors would probably balance up as designed. But consider the following:
Some counties are so large (Harris County in Texas has 4.7 million people) that they contain multiple House districts, so aggregation can’t save you here.
Lots of electoral districts need to be based on low-level geographies because we’re talking about state and local offices.
The people who actually draw legislative districts don’t respect county boundaries when they aggregate precincts together.
Long story short, even though the text of the constitution doesn’t require the Census Bureau to produce accurate block-level information, our constitutional system in practice presupposes that it will exist. Eliminating that information “leads to a likely violation of the ‘One Person, One Vote’ standard,” an outcome the authors characterize as “underscor[ing] the difficulty of balancing accuracy and respondent privacy in the Census.”
I will be less polite and say that breaking the concept of fair representation in order to (allegedly) make it harder for someone to find out your age, race, and sex by reverse-engineering Census data is really dumb. Everyone who wants to know this about you can already find out!
Now the other surveys are gonna be wrecked, too
The same basic principle also applies to the American Community Survey and the Current Population Survey.
Right now, the information is anonymous, but it’s also quite detailed. So in theory you could use a lot of computing power to make accurate individual-level inferences. The Census Bureau plans to address this in the ACS by using “synthetic data” and making other changes that compromise accuracy.
One of these proposed changes would round wage data to the nearest 50 cents rather than providing the exact number. That makes it harder to de-anonymize because it eliminates a source of differentiation. But it also completely breaks some forms of analysis. During the post-pandemic recovery, for example, one thing you have to watch for is compositional effects distorting wage data. Back in spring 2020, layoffs disproportionately impacted relatively low-wage workers in the restaurant business. That pops up in naive data sources as a huge surge in average wages, then when restaurants started hiring again, average wages fell.
The Atlanta Fed’s wage tracker tries to give you a more accurate view of the situation by paying attention to continuous wage changes experienced by actual people.
But as John Roberston noted on Twitter, the Atlanta Fed is relying on the CPS for their data.1 If they use rounded data, then their median wage growth metric gets totally broken. Not great!
Beyond the rounding, the synthetic data presents a fundamental conceptual problem for research. Here’s how Mike Schneider from the Associated Press describes the concept:
The synthetic data are created by taking variables in the microdata to build models recreating the interrelationships of the variables and then constructing a simulated population based on the models. Scholars would conduct their research using the simulated population — or the synthetic data — and then submit it, if they want, to the Census Bureau for double checking against the real data to make sure their analyses are correct.
This is a fun science experiment. But if researchers are only allowed to research fake (I mean “synthetic”) data based on statistical models, they can’t discover anything new. The model will be based on analysis of the actual data, but then all subsequent analysis will really just be investigations of the properties of the model, not of the underlying reality.
But beyond that, I agree with Margo Anderson, quoted in Schneider’s piece, that there is a question of principle here — the Census Bureau has traditionally been a source of accurate information and they are now planning to dispense inaccurate information:
“The Census Bureau is saying this is in the tradition of what they have always done” in protecting privacy, said historian Margo Anderson, a professor at the University of Wisconsin-Milwaukee. “There’s an increasingly substantial organization of critics saying this is completely different. They say, ‘You have never made the data intentionally inaccurate.'”
In terms of what is and is not an appropriate role for government institutions to play in our lives, I think providing accurate statistical information about the country is very good whereas providing inaccurate statistical information is not very good. And this seems like a much bigger deal than the largely hypothetical and low-stakes privacy concerns.
Someone needs to right this ship
The Census Bureau has a brand new director, Robert Santos, who was just sworn in on January 5. I hope he’ll change course on this. And I wouldn’t be too upset if people at the Department of Commerce or the White House encouraged him to take action.
But it’s a bit complicated. The Deputy Director of the Census, Ron Jarmin, was acting director all last year and also for 18 months of the Trump administration. And we know from recent reporting that Jarmin did some pretty heroic work in stopping the Trump administration from meddling with the main decennial Census. The independence and professionalism of the Census Bureau are really important to the country, and the value of that independence in general — and of Jarmin being a stickler in particular — was just proven.
So I think there will be a reluctance to be seen as meddling with the Census professionals, even though in this case it’s about a fundamentally non-partisan issue. And I think that reluctance is at least somewhat warranted.
The people who really ought to meddle here are members of Congress. The Bureau’s view is that these dramatic compromises to the accuracy and usefulness of Census data are required by Title 13 of the U.S. Code, which was passed in 1954. In other words, it’s not the case that Congress passed a new law mandating the use of these new procedures. Nor was this litigated in a way that ended with a federal judge ordering the Census Bureau to use these new procedures. Instead, the Bureau looks at generic statutory language saying that they “may furnish copies of tabulations and other statistical materials which do not disclose the information reported by, or on behalf of, any particular respondent” and says that because computers are now more powerful, they need to start obscuring the data.
You can follow the logic that led them to that conclusion. But the idea that their hands are tied by the statute is a little silly.
And if members of Congress don’t want to see the Census wrecked, it would be very constructive for them to say so — loudly and clearly, and ideally in a bipartisan manner. And if that doesn’t cause the Bureau to change course, they should write a new law that specifically directs the Bureau to keep producing its traditional products until they are actually told to stop.
Correction: I originally said this was ACS data when it’s actually from the CPS.
I get the intellectual privacy concern, but people need to realize we lost that war without a fight over 20 years ago. All your data is easily acquired by anyone who wants it. There is nothing you can do about it, unless you are willing to live in a cave. And even then, wait til the satellite cameras get better resolution....
No doubt, this is bad for people who really do need protections - such as people trying to flee abusive ex spouses, people in witness protection, etc. But being mad about that is like being mad at an asteroid coming at you from space... be mad all you want, but the asteroid doesn't care about your feelings.
"privacy concerns" are a mind virus that infects all of the most neurotic, low-trust people in this country. they have been a real blight to the discourse on all kinds of issues. never once have I heard an American do a sky-is-falling routine about how some thing is going to destroy their privacy and then have that thing actually turn out to be bad. it's ok for people to know things about you. they won't use it to hurt you. see a therapist. etc.