First off, I had to chuckle at the self-implementation of Betteridge's Law: posing a question in the headline, and then immediately refuting it in the byline.
I do wish this article made a better distinction between legal and illegal immigration. I get how estimating illegal immigration might be challenging, but legal should be extremely straightforward: we scan passports at the border, so at least to get a global count of "how many people entered/left the country on net", it should just be a matter of adding things up, right? This article seems to imply that that's not what we do (why?), and that we instead have to resort to post facto estimations, which feels nuts to me.
Counting like this would probably help with illegal entries/exits too: everyone who scanned out but not back in in N days is an emigrant, everyone who scanned in but not out in N days is an immigrant, even if they did not declare as such (aka, they are here illegally). This obviously does not account for on-foot border crossings, but it seems like such crossings without interdiction are pretty rare nowadays, since most of those folks want to get in the asylum queue and on the "official" rolls anyway.
Maybe I'm misunderstanding something, but it feels like agencies are working in silos, rather than sharing data and cooperating to come up with a single, canonical "here is the exact number of ins and outs" data set.
As far as I'm aware, we actually don't have complete records of who has left the country—as we don't do exit controls the way that a lot of other countries do. Like you know how a lot of countries do passport control and stamping both when you enter and leave? We only do it when you enter.
I imagine CBP still has access to data on who leaves by plane (despite the lack of exit checkpoints at airports). And perhaps they can ask Canada for data on land entries? And maaaybe Mexico? So that would help get some of the data. Still, I'm not aware of any law saying that foreigners have to enter and leave the U.S. using the same passport. So for example, a dual citizen of Germany and Turkey could enter the U.S. on their German passport and then go to Canada and enter Canada on their Turkish passport, and we would not have records clearly linking this as a single person entering and exiting the country. We could try doing probable matches based on name and DOB. But I guess the point here is just that things are not as straightforward as it may sound.
This does get to your other point of working in silos. Data silos are a huge problem in government (and any organization)! Although it's worth noting that data silos in government are sometimes intentional. For example, you don't want veterans to avoid telling VA health staff about their drug problems for fear that this will get reported to law enforcement. I'm not sure how much re: immigration data is intentional or not.
Exactly! Folks don't appreciate how challenging it can be to define "everything", "crosslinked" and "one big database" in a way that all stakeholders will agree and funding can be found.
NB: That isn't a reason to not engage in the challenge! Ignoring data management issues means they just fester in perpetuity. It's a difficult problem that requires diligent people chipping away at it tirelessly over time.
It may be hard, but it's probably easier than going to the moon or building the interstate highway system, and we did that. If the government says "between 1 and 3 million people immigrated last year, a difference in estimate of 300%", and I say, "That's a very bad estimate, have you tried just counting people at this fixed set of already highly bureaucratized check points", and they say "synchronizing that data would be hard, we'll just stick with the comically imprecise estimate, thanks", I will be (justifiably) disappointed in my government.
Yes, that is what I'm asking for. I get that the government is big and unwieldy, but it's a bit sad that its capacity is so low that it cannot functionally work with...itself.
That’s not *low* capacity - it’s a problem of capacity being so *high* within each branch that other branches can’t keep up. When God creates a rock that is too heavy for him to lift, that’s as much a sign of how great his weight creation capacity is as how low his lifting capacity is.
I think you should consider how many other large organizations you interact with (companies or universities or other governments etc) that also have this problem despite much smaller scale and often fewer regulatory constraints than the US government.
One of the estimates described in the article (the higher CBO estimate) is based on counting people at exactly those check points. But you will note from the article that experts do not treat this as a definitive answer. Because this is hard!
Once when I was talking to a friend who had gone to work at Google, he talked about how at scale, even simple things, like counting how often something happened, became an intellectually interesting challenge. Sometimes problems are actually hard to solve.
Having literally worked on large distributed system synchronization at Google, I feel confident in claiming that yes consistent, replicable, partition tolerant and fast databases are hard, but also a. there are well-worn off the shelf solutions to this problem from literally dozens of vendors (including Google!), b. the requirements of this proposed database are MUCH looser than Google's (ex: there are basically no upper bounds on latency), and c. this is still a simpler project than ex: building the Eisenhower tunnel (though maybe this the "stuff I know is easy, other stuff is hard" bias talking).
I think the problem is what I was alluding to above: state capacity generally, but especially as it pertains to IT, is comically low right now. See the healthcare.gov fiasco for one prominent example. Claiming that a thing that thousands of organizations the world over do all the time is too hard and too special for the US government to pull off is just one instance of this worrying trend, and a perfect example of Alon Levy's "incuriosity of American planners" syndrome.
Ironically, a friend of mine who is a dual citizen of the US and Israel (and mostly grew up in the US) found out once when he was entering Israel that they had believed he had been in Israel consistently for a decade and not the US.
I agree there are edge cases that make this hard. But how often does the case you describe (or others we could come up with) really happen? If we figured out a decent way to aggregate four sources of data, we'd be 99% of the way there: 1. ins/outs at (air)ports, 2. the same at the Canadian border and transitively via their ports, 3. same for legal crossings to/from Mexico, and 4. Illegal crossings. I concede that 4 has some murkiness, but the for the other three, getting very close to the ground truth seems quite achievable.
If overstaying a visa is one of the most common forms of illegal immigratiom, why haven't people been pushing for exit controls to catch people who have overstayed?
Knowing that someone hasn't left does not qualify as "catching" them. Nor would it, in most cases, really accelerate catching them which would likely only happen when they interact with law enforcement.
Sure, but they have a flight manifest. Again, this wouldn't cover all cases (I could drive to Guatemala to board my flight Berlin), but doing some reasonable guess work around the edges would produce a much better estimate than the wild variance this article is addressing. It feels like our current estimate is at the "how many ping pong balls fit in a 747?" level, when we have the ping pong ball receipts just lying around in a filing cabinet.
Who's 'they' here in 'they have a flight manifest'? Does the federal government, or does the airline? Genuinely don't know. But if the federal government, which bit? Like, to give you an example, I work for an agency in the Pacific Northwest. We manage a lot of land for the federal government, so do several other agencies (BLM, USACE, BPA, BOR, FS, Parks)--there is no one 'federally owned land' database. If BPA wants to figure out what rights the BOR has over property, they have to either go to the county, or go to BOR, or both.
Now, this drives me up the wall, because the Real Estate stuff (and survey data/aerials/etc) is basically never confidential, or protectable under FOIA, so I'd really like it to all just be publicly posted in a nicely searchable database, so I don't have to deal with FOIA requests for it.
The problem with that is:
1) Building that database and maintaining it is expensive and not something we're funded for.
2) So long as the files are internal, it's fine that they're PDF scans of 50 year old records that you have to get real close and do some guesswork to actually read, but if they're publicly posted, then you need to make them 508 (web accessibility for the disabled) compliant, which is not easy generally and really not easy for documents like this (by the way, 508 compliance is also a main reason 'release to one is release to all, just publicly post almost everything anyone requests under FOIA in one big database to make things easier/actually serve the public's alleged need for this information' doesn't happen, nor does a lot of publicizing of information), good example of a place where the idea is good and generally makes sense (especially for government websites you need to access to, say, apply for disability) but overbroad application causes it to make it so stuff is just not posted for anyone, rather than posted and accessible to everyone.
But maybe more advanced tech tools will help with that...
And this has been ECD's 'let's make a topic about one thing into me ranting about FOIA issues...'
I'm impressed to the degree that the government seeks to accurately maintain and quickly access the "No Fly List". If they can! But that's relatively simple and with extremely high motivation behind it. The idea that they would somehow use flight manifests to monitor people overstaying visas reflects a view of the governmental panopticon that only exists in movies.
They don't accurately maintain the "No Fly List"! It has all sorts of terrible problems with accuracy, missing people, wrong people, name collisions, etc.
You're telling me those action rooms where a dozen people with two dozen big screen TVs DON'T actually have access to all the data and cameras in all the world, aren't real?
As someone who interacts a lot with consumer-facing public lands tools, are you trying to say that sites like the USGS PAD-US are incorrect? Or that they operate at an insufficient level of accuracy around the margins? I’d love some object-level clarification, maybe getting in the weeds here, because this is something much more interesting and applicable to me personally than it likely is for the vast majority of commentators here.
Not particularly familiar with PAD-US, but given that they straight up say they're "the best available aggregation of federal land and marine areas provided directly by managing agencies, coordinated through the Federal Geographic Data Committee Federal Lands Working Group." I don't think they're claiming completeness or complete accuracy. I'm the best basketball player in my family--but that's REALLY not saying much.
However, I will say that I know that my agency has been working on GIS mapping data and has some tools that are reasonably accurate. But in many cases, what you'll see is people pulling from multiple data sets which won't actually align. So, just recently we were dealing with a local county attempting to sell federal property, because their county assessor claimed they owned it and their own records had been lost in a fire ~20 years ago. We were able to produce our deed for it, but if folks are pulling from multiple data sets, how do they resolve discrepancies?
I'm quite confident PAD-US is, at minimum, incomplete, because I happen to know agencies which do not have complete GIS databases of their lands and problems are always being discovered with the existing ones (stuff gets left out, or mapped incorrectly in the system, or labelled incorrectly--nope, we have an access easement over that property, it's not fee property, but someone clicked the wrong checkbox when entering it and now our system thinks we do!).
But, to be clear, my original complaint was actually a bit different. There's two issues, there's 'who owns what property?' and there's 'what document proves that?' My original complaint was about the latter. In most cases it'll be a deed, a judgment, or a declaration of taking and it's those underlying documents that are a medium pain to share, rather than being in a nice database. Now, a REALLY nice database would nicely connect those documents to mapping which showed their physical location and I think most agencies are in fact working on such, just for internal management. In a perfect world, once those are in reasonably good shape, someone will realize 'hey this is a bunch of useful data, let's just make it available to the public.'
Then it will be my job to say 'great idea, just spend ~5-10 million dollars making it all 508 compliant and we can do it!' (Note: cost estimate sourced directly from my posterior). Well, and also 'here are the fifty three caveats to put on the data, so when people discover errors they have a harder time suing us about them.'
If there's more specific questions you've got, I'm always happy to chat about this stuff though!
Yeah, this is fascinating stuff. I’m used to using PAD-US and other, often third-party tools (OnX, CalTopo etc, all of which I believe are based on an Esri dataset with a few tweaks) but all of this is typically for recreation uses — which makes it easier because I imagine NPS/FS/BLM/FWS land types are among the best-attested. The PAD-US goes further in specifically trying to incorporate all land covered by a recreation / conservation easement as well (obviously much more difficult) but I imagine USACE perhaps or BOR land has more issues with data quality? I suppose what I’m really curious about here is which types of federally-owned land you typically have an issue with?
Tracking ins and outs wouldn’t really work for immigration (the vast majority of ins and outs aren’t immigrants) but it would be interesting to know what the annual and longer patterns are in foreigners present in the country. Not obviously worth enough to make people wait in an extra line while leaving the country though.
The CDC Natality Database has a fairly complete database of all births that take place in the US, with very detailed stats on every mother (and father, if known) including country of origin. So one can get some very detailed counts of how many Haitian mothers gave birth each year in Ohio (about 50 in 2020, now closer to 800, split mostly in Franklin County (Columbus) and Clarke County (Springfield).
More to the point of this article, total births to non-native mothers appears to be slightly increasing over the past 3 years: 832k, 843k, 856k while native-born-mother births are slightly falling: 2.8M, 2.7M, 2.7M. The 2024 number of 856k is about 3% higher than the 2022 number of 832k, which is much more in line with the ACS measurements than the CBO's.
Of course counting births is an odd way of trying to count immigrants, but it has some pros depending on your use case, especially if you were interested in future demographics, or immigrants who are likely to stay, or even crime as men are violent around the same ages that women give birth. And compared to survey data, it's very complete and detailed. Nearly every birth is captured.
I also made an interactive tool using this natality data that reports a lot of the information the CDC captures in various ways:
Really interesting! But the kinds of annual changes that make big shifts in continent of origin of immigrants over three years might also cause major changes in the gender composition, and age, and natality intentions of those immigrants too, so I would be very careful about estimating changes in total immigration based on changes in births to foreign born mothers.
It's just another data point. I think it has value for several reasons and one of those is that it's just a very different data source. That independence can add value, but also adds risks as you're pointing out.
If I was tasked with getting a more accurate answer next week, my overall strategy would be to try to gather several more independent data sources and triangulate or average or blend them together. If one source is way out of line with the others, then something might be wrong with that source, like the shifts your talking about. But you have to collect a lot of sources to really know which are out of line :)
One need not be a nativist or anti-immigration to think that it is damning that federal agencies have such wildly different estimates of annual immigration that they differ by three times. That is an implicit confession from the federal government that we have poor control over our borders. If the GOP weren't the stupid party, it would be talking about that one factoid around the clock.
It is also objectively nuts that "Questions about birthplace and citizenship are not currently asked on the decennial Census." (Yes, I understand why -- in theory not asking about that stuff drives compliance -- but still, c'mon, man.)
I think you're confused about the direction of the error. The administrative data, from the Border Patrol and Immigration, has the high count, while the survey data, which could capture people who come in illegally and are not caught, has the low count.
That just happened in the last year or so. Prior to that the estimates were a lot closer. So let's not be too quick to "damn" them but, as Jed says, recognize that something went out of kilter recently and we need to dive into that and figure out why.
It's data. Sometimes they're just messy and work malevolently to fool you.
This reflects a misunderstanding of the purpose of these data products. They're not boarder security reporter cards, and if you try to use them as such, you'll find they're not well suited.
You bring up the decennial Census not asking questions about birthplace or citizenship and seem to understand the reason why. Yet you object, because you want the decennial Census to be something it isn't. If you want to add language about quantifying levels of illegal immigration to their mission statement, go for it. But at some point, when you're creating data products as large and complex as the ones we're talking about, you have to make hard decisions about priorities and goal. Do you want to support constitutionally required Congressional redistricting by getting as accurate a count of the number of people in different places? Or do you want to try to count the immigrants? These goals are, to a degree, at odds, and leaders must prioritize.
You are wrong. I just read the relevant text in the Constitution, Article 1, Section 2:
" The actual Enumeration shall be made within three Years after the first Meeting
of the Congress of the United States, and within every subsequent Term of ten Years, in
such Manner as they shall by Law direct, with the primary aim of counting criminals and mental asylum patients entering these United States from Mexico."
I am not sure that's the only reason they don't ask about citizenship on the census. There was a ton of litigation on this during the Trump administration. Liberal groups vehemently opposed asking and I assume that vehemence was based on the political implications of such a question in allocating seats in Congress, not merely a concern about compliance with the census.
Does it need to be this hard? Like, my take away is that we have the full might of the worlds wealthiest government with its most powerful and far reaching intelligence collection agencies and we are just sort of the Spider-Man pointing at each other meme when it comes to collecting immigration information?
This kind of stuff makes me want to run screaming into the woods.
I think this is an excellent example of something to defer to the experts on. If the well-intentioned economist who was an under secretary of commerce says it's hard, it's probably pretty hard!
The, "if we really wanted to do that" is the key. We do want and need to know things like how many people are in the country, and what is the size and shape of the labor force. But we want to know those things as tools to accomplish other tasks, not as ends in themselves.
Having ever sharper drill bits helps but is neither necessary nor sufficient for the Slow Boring of Hard Boards which is really the ultimate task the government needs to accomplish.
Is it harder or easier than developing a vaccine, building a high speed rail system, making everyone pay their taxes, or convincing people not to vote for a lying, cheating, sexually assaulting maniac?
Anything involving millions of people is hard, especially if you want it done accurately, and without violating a lot of basic assumptions about rights.
Experts making forward looking predictions might easily differ by 2.5x, especially when exponential growth is involved. However, when experts differ by 2.5x on a current headcount, Im pretty confident I could do better. Give me a staff and a budget and I could definitely do better.
These government experts don’t have much skin in the game. They don’t profit for being right. They aren’t really punished for just sticking to the old methodology on auto pilot. They just put in their 40 hours and get their step increases. There is no incentive to hit it out of the park, and strong incentives to keep your head down.
"Give me a staff and a budget and I could definitely do better." ... "There is no incentive"
Do you mean to say you would do better because of your talents or you would do better because someone would give you incentives?
If the issue is incentives, then David Abbott is not the key piece of the puzzle. If the issue is you're more talented then why would you have incentives that the staff experts don't have?
In either case - the experts don't have the staff and the budgets. They're using data that happens to be collected for other purposes and applying it to a more niche problem that it wasn't designed for. If you give me a staff and a budget that was specifically designed to count immigrants I guess I could also do better, but so could anyone.
I think the positions of "this is a difficult issue with confounding incentives and perennially small budgets" and "even still, estimates that are off of each other by 300% are a bit ridiculous, and we should do better" are perfectly compatible positions to hold at once.
I wonder how peer countries are doing? I have little data, but I would bet their estimates are much more accurate, and obtained for similar costs or less. As is the case with high speed rail. Or health care. Or automated tax filing. Or...
It seems to often be the case that we have local experts confidently claiming that X can't be done, while every other OECD country has been outperforming us at X for decades.
If you want a “peer country” (there aren’t real peer countries to the US), Sweden’s system is so bad that there is disagreement as to whether or not the net emigration this year is real or a statistical artifact of investigating suspected welfare fraud and the state has been unknowingly giving benefits to people living outside of the country (eg the Iraqi defense minister: https://en.wikipedia.org/wiki/Najah_al-Shammari).
I find the definitions of "peer countries" to usually be quite arbitrary and not well thought out. I wouldn't call Australia, Japan or NZ "peer countries" here, because their immigration situation is entirely different as island nations.
Countries in Asia also have very different takes on civil liberties, and, in common with Europe, as usual, have been severely underperforming the US in terms of assimilating new immigrants for decades.
Another reason I wouldn't expect Europe to be any better, if they can even be considered peers, is they, as usual, have far stricter privacy laws, which for decades have made much of their civic and economic progress ludicrously difficult.
In terms of response rate, this is one where the peer countries would seem to have much in common with the origin country of immigrants and the ancestral origin of US citizens. Response rates drive the accuracy of census data, and response rates in the US are highly linked to ancestries, ie, citizens and immigrants linked to W Europe have census response rates similar to W Europe, citizens linked to ancestry or origin in Mexico have response rates similar to Mexico. So if we're talking about the SW US, in many ways Latin America is our closest peer.
Wait until you get into the weeds of the Dictionary of Occupational Titles, what’s supposed to be a comprehensive look at our labor force for purposes like disability claims, but which hasn’t been updated since 1991 despite endless federal committees dedicated to updating it, is full of outdated jobs like “carnival weight-guesser” and “telegraph operator”, and which causes interminable appeals as two experts who rely on it can confidently disagree on whether 5,000 or 500,000 of a particular job currently exist.
They better be *rewarded* for sticking to the old methodology on autopilot, and *punished* for making changes in methodology that aren’t amply tested and validated and understood in a way that minimizes incomparability of data from year to year.
You clearly haven’t worked with data collected at a large scale before. All these problems are not unique to government and are things that data scientists and data engineers deal with to varying degrees at private companies large and small. Data collection and analysis is just way more complicated to actually do than it is to talk about!
Different silos are trying to measure different things, and care about different types of errors differently. There’s no way that reconciled numbers will serve any of these purposes as well as the different estimates do, even though those of us outside the agencies might have interests that are better served by reconciled numbers.
I remember when the CIA World Factbook was one of everyone's go-to resources for global and economic data and how lacking that was in retrospect. And it's not like the private sector necessarily had something better sitting around, especially considering how many private sector economists rely on census data and other government sources.
It's completely unsurprising. The "full might of the worlds wealthiest government" also is incapable of building high speed rail, balancing its budget, winning a war against a much smaller Third World country, controlling the number of people who cross our border, and keeping health care costs to less than 1.5X per capita what anyone else in this world pays.
Also, "the full might of the world's wealthiest government" isn't working on this. The Census Bureau, for example, is around 0.2% of the Federal Budget. And its subject to all sorts of restrictions and limitations, many (all?) of which you wouldn't at all want lifted.
Trying to estimate immigration levels using government survey data when a majority of new arrivals are illegal is insane. It’s like trying to estimate drug dealing by giving 50 million people a survey and asking “how many times did deal crack last year? how about meth?”
All of the technical “refinements” built on these idiotic assumptions are deck chairs on the titanic. An honest estimate would look for places that hire illegal immigrants and covertly document employment levels and layoff patterns. You would need other data points, but relying on people who are here illegally to fill out a survey and not trying really hard to capture non responders is not intellectually honest.
"An honest estimate would look for places that hire illegal immigrants and covertly document employment levels and layoff patterns"
This would pretty obviously also have its own flaws. In any case, there's no obvious way to solve the problem of counting people that are undocumented, short of ramping up or improving on existing counting methods. Commenters are acting like this is easy.
I never said this is easy, far from it. I’m not sure it’s possible to consistently come within 10 or even 15% of the true figure. However. for two estimates to be off by 2.5x is like Rasmussen saying Harris has 25% of the vote and the NYTimes saying she has 62%. Someone’s methodology is crazy.
It’s not like that, because a Rasmussen survey is measuring percentages, not absolute numbers. Estimates of percentages that vary by 37 percentage points are very different from estimates of totals that differ be a factor of 3.
Both could result from sampling mechanisms that happen to have very large bias for the quantity in question.
But the fact that this occurred just in the past year or two suggests that the methodologies were working pretty well until recently.
You are the one claiming to know what proportion of immigrants are in the country legally. I’m saying it’s hard to measure.
Fwiw, there was a huge increase in 2022/23 in the number of unauthorized immigrants apprehended near the border. That certainly suggests the proportion of new immigrants who are unauthorized has increased.
You're the one claiming that "a majority of new arrivals are illegal." Either back that up or don't. Saying that there's been a year-to-year increase in unauthorized border crossings in no way actually backs up what you claimed.
To the specific point its pretty clear that most immigration over the last couple of years has not been primarily planned and approved immigration. Its mostly been a mix of illegal and asylum claims. The latter is in a grey zone where its not illegal exactly, but given the abuse of asylum claims its not really what was intended either.
See here for a visualization and more detail using the CBO estimates:
I'm a bit of a immigration maximalist, so would solve for this by opening up legal immigration dramatically, but we should deal with the facts that are, not that facts we might want.
Asking people how often they engage in a behavior is an extremely well-established way to estimate the prevalence of said behavior. It's not perfect, but it's extremely widely used.
You’d also want to just randomly survey sewage for metabolites. If overall concentrations are too low to measure, you could even check the sewage of high use places like hotels and night clubs. This would be expensive but if you really want to get the number right, you just find as many plausibly important parameters as possible, see if they provide meaningful data, and then argue over what coefficient each should have.
whether one parameter gets 20% or 25% of the weight generally won’t affect the estimates by much.
the conceit of survey data is its very good at pretending to be comprehensive and subject only to random sampling error. can can easily ask people anything. you can’t conveniently measure the sewage at all the night clubs, it’s expensive and it’s obviously only one piece of the puzzle. that sort of data is very valuable, but only once you give up on an exact answer and commit to looking at multiple parameters
i’d start by looking at overdose and dui deaths. i’d also look at arrest data, interview narcotics detectives in major metro areas and mash them together to come up with some sort of aggregate
Ever since the 2020 census revised NYC’s population upward by around 5% (along with many other urban core areas) I have been relatively skeptical of how the ACS weights work around inner cities; not coincidentally many immigrants tend to concentrate around the urban core. I was hoping the CB had done some work to unbias ACS estimates in the future but…maybe not?
Needless to say the rollout of DP has really shaken my confidence in census bureau decision making.
There a re a fair number of commenters that seem disturbed at the idea that we don't do this much better, and wonder why we can't/don't.
But that overlooks why we're trying to do this at all. We do want and need to know things like how many people are in the country, and what is the size and shape of the labor force. But we want to know those things as tools to accomplish other tasks, not as ends in themselves.
Having ever sharper drill bits helps, but is neither necessary nor sufficient for the Slow Boring of Hard Boards, which is really the ultimate task the government needs to accomplish. You could spend an enormous amount of time sharpening your drill bits. That doesn't actually get you a hole in the wood.
I was genuinely confused about whether illegal/unofficial immigration was included in this explanation ("illegal", "undocumented", and "asylum" do not appear at all in the piece) until I saw a mention of border encounters.
Fo the last twenty years at least the estimate of illegal immigrants in the country has been 11 million. This number is invariably trotted out and seems to be carved in stone
It looks like there might have been some major change in the 1980s leading to a greater rate of influx. It then took 20 years for that to build up until the rate of outflow roughly equalized influx. There have been smaller changes since then leading to increases and decreases in what the rough steady state would be, but it seems likely that the numbers in the 1990s were so low only because we hadn’t yet reached steady state.
Count me as another person who is surprised that this is so hard.
I get that estimating the number of illegal/undocumented immigrants is hard. But legal immigrants? Legal immigrants are documented to within an inch of their lives. Before I could enter the US as an F-1 student, I had to submit the proper paperwork. My passport was stamped when I entered (in Blaine, Washington), and then, as a condition of remaining an F-1 student in good standing, I had to inform the USCIS (US Citizenship and Immigration Services) within 30 days if I changed my address. The USCIS knew who I was and where I lived. Then, when I applied for my green card, that meant a bunch more paperwork, and when I applied for and eventually received my citizenship, there was a bunch of paperwork too (including a shiny new naturalization certificate).
How hard is it to keep track of all this info in some database? If you want to anonymize it for research purposes, changing "drosophilist became a US citizen on July 12, 2018" to "A39815Z22GJ09 became a US citizen on July 12, 2018" should be pretty trivial, no?
One part of the answer may be that the use cases are most interested in how many residents we have, regardless of the various statuses. So we want to know how many US citizens currently live here, and not accidentally count those who have left (to work in another country or just died). Ditto for legal immigrants who are in various stages of the visa , permanent residency or citizenship pathway.
It's not even all that easy to track how which native-born citizens have died or left the country, let alone these other categories. Maybe fully legal, naturalized citizens would be some of the easiest to track, but it still could be tough.
Great choice, thank you for running it. I hope readers keep this column in mind when Matt writes about macroeconomics. If the inputs to macroeconomic calculations are demonstrably very error-prone ...
Except they’re not! In one or two years out of several decades, there have been major differences in two estimates. But most of the time the two are pretty close. Maybe it’s more error-prone than you imagined, but it’s less error-prone than people who actually do data collection might have guessed!
Say I'm here in the US illegally and someone from the "ACS" (whatever *that* is) is in front of me asking me if I'm a citizen and where I come from. How would I answer?
Trick question! I'm doing my roadrunner imitation and beating a hasty retreat before he opens his mouth.
And yet, we know that people do in fact answer. It's not like the survey designers don't think ahead of time about the challenges they face and try to counteract them.
There is a world in which an accurate count of migrant flows for the US has been tried and found hard. There is also a world in which it has been found inconvenient and left largely untried.
How do I, as a data nerd, tell the difference between these two worlds?
Accessible-but-dry, a few too many weeds (had trouble keeping all the acronyms straight, for example); lot of ink to answer "no" to the headline question. Avoids the obvious followup of "yes, correctly controlling the labour demand knob is important, but what about the labour supply knob?"...which is certainly one way to assign the dependent variable, I guess. But I do appreciate hearing from a SME. James Scott would have loved this post, RIP. *pours one out*
From one nerd to another, I think I speak for the community when I say, “Thank you for your service”.
SB is probably the only place in the country where an economist would get that kind of welcome! 🤣
Author called the census microdata release a “real goldmine.” What a nerd, cool hearing from a subject matter expert on this issue. Cool read!
A little deep in the weeds for me, but I appreciate the author for contributing the piece and SB for publishing it!
First off, I had to chuckle at the self-implementation of Betteridge's Law: posing a question in the headline, and then immediately refuting it in the byline.
I do wish this article made a better distinction between legal and illegal immigration. I get how estimating illegal immigration might be challenging, but legal should be extremely straightforward: we scan passports at the border, so at least to get a global count of "how many people entered/left the country on net", it should just be a matter of adding things up, right? This article seems to imply that that's not what we do (why?), and that we instead have to resort to post facto estimations, which feels nuts to me.
Counting like this would probably help with illegal entries/exits too: everyone who scanned out but not back in in N days is an emigrant, everyone who scanned in but not out in N days is an immigrant, even if they did not declare as such (aka, they are here illegally). This obviously does not account for on-foot border crossings, but it seems like such crossings without interdiction are pretty rare nowadays, since most of those folks want to get in the asylum queue and on the "official" rolls anyway.
Maybe I'm misunderstanding something, but it feels like agencies are working in silos, rather than sharing data and cooperating to come up with a single, canonical "here is the exact number of ins and outs" data set.
Re: "we scan passports at the border"
As far as I'm aware, we actually don't have complete records of who has left the country—as we don't do exit controls the way that a lot of other countries do. Like you know how a lot of countries do passport control and stamping both when you enter and leave? We only do it when you enter.
I imagine CBP still has access to data on who leaves by plane (despite the lack of exit checkpoints at airports). And perhaps they can ask Canada for data on land entries? And maaaybe Mexico? So that would help get some of the data. Still, I'm not aware of any law saying that foreigners have to enter and leave the U.S. using the same passport. So for example, a dual citizen of Germany and Turkey could enter the U.S. on their German passport and then go to Canada and enter Canada on their Turkish passport, and we would not have records clearly linking this as a single person entering and exiting the country. We could try doing probable matches based on name and DOB. But I guess the point here is just that things are not as straightforward as it may sound.
This does get to your other point of working in silos. Data silos are a huge problem in government (and any organization)! Although it's worth noting that data silos in government are sometimes intentional. For example, you don't want veterans to avoid telling VA health staff about their drug problems for fear that this will get reported to law enforcement. I'm not sure how much re: immigration data is intentional or not.
Also, "everything should be crosslinked in one big database" is easy to say in a comment section, hard to implement in practice.
Exactly! Folks don't appreciate how challenging it can be to define "everything", "crosslinked" and "one big database" in a way that all stakeholders will agree and funding can be found.
NB: That isn't a reason to not engage in the challenge! Ignoring data management issues means they just fester in perpetuity. It's a difficult problem that requires diligent people chipping away at it tirelessly over time.
It may be hard, but it's probably easier than going to the moon or building the interstate highway system, and we did that. If the government says "between 1 and 3 million people immigrated last year, a difference in estimate of 300%", and I say, "That's a very bad estimate, have you tried just counting people at this fixed set of already highly bureaucratized check points", and they say "synchronizing that data would be hard, we'll just stick with the comically imprecise estimate, thanks", I will be (justifiably) disappointed in my government.
Except that you’re asking not for a one-time project, but for ongoing interactions between different agencies.
Yes, that is what I'm asking for. I get that the government is big and unwieldy, but it's a bit sad that its capacity is so low that it cannot functionally work with...itself.
That’s not *low* capacity - it’s a problem of capacity being so *high* within each branch that other branches can’t keep up. When God creates a rock that is too heavy for him to lift, that’s as much a sign of how great his weight creation capacity is as how low his lifting capacity is.
I think you should consider how many other large organizations you interact with (companies or universities or other governments etc) that also have this problem despite much smaller scale and often fewer regulatory constraints than the US government.
One of the estimates described in the article (the higher CBO estimate) is based on counting people at exactly those check points. But you will note from the article that experts do not treat this as a definitive answer. Because this is hard!
Once when I was talking to a friend who had gone to work at Google, he talked about how at scale, even simple things, like counting how often something happened, became an intellectually interesting challenge. Sometimes problems are actually hard to solve.
Having literally worked on large distributed system synchronization at Google, I feel confident in claiming that yes consistent, replicable, partition tolerant and fast databases are hard, but also a. there are well-worn off the shelf solutions to this problem from literally dozens of vendors (including Google!), b. the requirements of this proposed database are MUCH looser than Google's (ex: there are basically no upper bounds on latency), and c. this is still a simpler project than ex: building the Eisenhower tunnel (though maybe this the "stuff I know is easy, other stuff is hard" bias talking).
I think the problem is what I was alluding to above: state capacity generally, but especially as it pertains to IT, is comically low right now. See the healthcare.gov fiasco for one prominent example. Claiming that a thing that thousands of organizations the world over do all the time is too hard and too special for the US government to pull off is just one instance of this worrying trend, and a perfect example of Alon Levy's "incuriosity of American planners" syndrome.
Ironically, a friend of mine who is a dual citizen of the US and Israel (and mostly grew up in the US) found out once when he was entering Israel that they had believed he had been in Israel consistently for a decade and not the US.
I agree there are edge cases that make this hard. But how often does the case you describe (or others we could come up with) really happen? If we figured out a decent way to aggregate four sources of data, we'd be 99% of the way there: 1. ins/outs at (air)ports, 2. the same at the Canadian border and transitively via their ports, 3. same for legal crossings to/from Mexico, and 4. Illegal crossings. I concede that 4 has some murkiness, but the for the other three, getting very close to the ground truth seems quite achievable.
If overstaying a visa is one of the most common forms of illegal immigratiom, why haven't people been pushing for exit controls to catch people who have overstayed?
Knowing that someone hasn't left does not qualify as "catching" them. Nor would it, in most cases, really accelerate catching them which would likely only happen when they interact with law enforcement.
I think Matt S was suggesting nabbing or penalizing them as they leave?
Which might make sense or might introduce some fairly perverse incentives.
Ah...possibly...
Have you ever crossed the border? Hint they don’t scan you on the way out.
Sure, but they have a flight manifest. Again, this wouldn't cover all cases (I could drive to Guatemala to board my flight Berlin), but doing some reasonable guess work around the edges would produce a much better estimate than the wild variance this article is addressing. It feels like our current estimate is at the "how many ping pong balls fit in a 747?" level, when we have the ping pong ball receipts just lying around in a filing cabinet.
Who's 'they' here in 'they have a flight manifest'? Does the federal government, or does the airline? Genuinely don't know. But if the federal government, which bit? Like, to give you an example, I work for an agency in the Pacific Northwest. We manage a lot of land for the federal government, so do several other agencies (BLM, USACE, BPA, BOR, FS, Parks)--there is no one 'federally owned land' database. If BPA wants to figure out what rights the BOR has over property, they have to either go to the county, or go to BOR, or both.
Now, this drives me up the wall, because the Real Estate stuff (and survey data/aerials/etc) is basically never confidential, or protectable under FOIA, so I'd really like it to all just be publicly posted in a nicely searchable database, so I don't have to deal with FOIA requests for it.
The problem with that is:
1) Building that database and maintaining it is expensive and not something we're funded for.
2) So long as the files are internal, it's fine that they're PDF scans of 50 year old records that you have to get real close and do some guesswork to actually read, but if they're publicly posted, then you need to make them 508 (web accessibility for the disabled) compliant, which is not easy generally and really not easy for documents like this (by the way, 508 compliance is also a main reason 'release to one is release to all, just publicly post almost everything anyone requests under FOIA in one big database to make things easier/actually serve the public's alleged need for this information' doesn't happen, nor does a lot of publicizing of information), good example of a place where the idea is good and generally makes sense (especially for government websites you need to access to, say, apply for disability) but overbroad application causes it to make it so stuff is just not posted for anyone, rather than posted and accessible to everyone.
But maybe more advanced tech tools will help with that...
And this has been ECD's 'let's make a topic about one thing into me ranting about FOIA issues...'
Well said.
I'm impressed to the degree that the government seeks to accurately maintain and quickly access the "No Fly List". If they can! But that's relatively simple and with extremely high motivation behind it. The idea that they would somehow use flight manifests to monitor people overstaying visas reflects a view of the governmental panopticon that only exists in movies.
They don't accurately maintain the "No Fly List"! It has all sorts of terrible problems with accuracy, missing people, wrong people, name collisions, etc.
You're telling me those action rooms where a dozen people with two dozen big screen TVs DON'T actually have access to all the data and cameras in all the world, aren't real?
I'll never see a spy thriller the same way again.
As someone who interacts a lot with consumer-facing public lands tools, are you trying to say that sites like the USGS PAD-US are incorrect? Or that they operate at an insufficient level of accuracy around the margins? I’d love some object-level clarification, maybe getting in the weeds here, because this is something much more interesting and applicable to me personally than it likely is for the vast majority of commentators here.
Not particularly familiar with PAD-US, but given that they straight up say they're "the best available aggregation of federal land and marine areas provided directly by managing agencies, coordinated through the Federal Geographic Data Committee Federal Lands Working Group." I don't think they're claiming completeness or complete accuracy. I'm the best basketball player in my family--but that's REALLY not saying much.
However, I will say that I know that my agency has been working on GIS mapping data and has some tools that are reasonably accurate. But in many cases, what you'll see is people pulling from multiple data sets which won't actually align. So, just recently we were dealing with a local county attempting to sell federal property, because their county assessor claimed they owned it and their own records had been lost in a fire ~20 years ago. We were able to produce our deed for it, but if folks are pulling from multiple data sets, how do they resolve discrepancies?
I'm quite confident PAD-US is, at minimum, incomplete, because I happen to know agencies which do not have complete GIS databases of their lands and problems are always being discovered with the existing ones (stuff gets left out, or mapped incorrectly in the system, or labelled incorrectly--nope, we have an access easement over that property, it's not fee property, but someone clicked the wrong checkbox when entering it and now our system thinks we do!).
But, to be clear, my original complaint was actually a bit different. There's two issues, there's 'who owns what property?' and there's 'what document proves that?' My original complaint was about the latter. In most cases it'll be a deed, a judgment, or a declaration of taking and it's those underlying documents that are a medium pain to share, rather than being in a nice database. Now, a REALLY nice database would nicely connect those documents to mapping which showed their physical location and I think most agencies are in fact working on such, just for internal management. In a perfect world, once those are in reasonably good shape, someone will realize 'hey this is a bunch of useful data, let's just make it available to the public.'
Then it will be my job to say 'great idea, just spend ~5-10 million dollars making it all 508 compliant and we can do it!' (Note: cost estimate sourced directly from my posterior). Well, and also 'here are the fifty three caveats to put on the data, so when people discover errors they have a harder time suing us about them.'
If there's more specific questions you've got, I'm always happy to chat about this stuff though!
Yeah, this is fascinating stuff. I’m used to using PAD-US and other, often third-party tools (OnX, CalTopo etc, all of which I believe are based on an Esri dataset with a few tweaks) but all of this is typically for recreation uses — which makes it easier because I imagine NPS/FS/BLM/FWS land types are among the best-attested. The PAD-US goes further in specifically trying to incorporate all land covered by a recreation / conservation easement as well (obviously much more difficult) but I imagine USACE perhaps or BOR land has more issues with data quality? I suppose what I’m really curious about here is which types of federally-owned land you typically have an issue with?
I hadn’t heard about this 508 compliance issue - really interesting!
Tracking ins and outs wouldn’t really work for immigration (the vast majority of ins and outs aren’t immigrants) but it would be interesting to know what the annual and longer patterns are in foreigners present in the country. Not obviously worth enough to make people wait in an extra line while leaving the country though.
Here's another data source that might be of interest, although it has its own strengths and weaknesses: https://wonder.cdc.gov/natality.html
The CDC Natality Database has a fairly complete database of all births that take place in the US, with very detailed stats on every mother (and father, if known) including country of origin. So one can get some very detailed counts of how many Haitian mothers gave birth each year in Ohio (about 50 in 2020, now closer to 800, split mostly in Franklin County (Columbus) and Clarke County (Springfield).
More to the point of this article, total births to non-native mothers appears to be slightly increasing over the past 3 years: 832k, 843k, 856k while native-born-mother births are slightly falling: 2.8M, 2.7M, 2.7M. The 2024 number of 856k is about 3% higher than the 2022 number of 832k, which is much more in line with the ACS measurements than the CBO's.
Of course counting births is an odd way of trying to count immigrants, but it has some pros depending on your use case, especially if you were interested in future demographics, or immigrants who are likely to stay, or even crime as men are violent around the same ages that women give birth. And compared to survey data, it's very complete and detailed. Nearly every birth is captured.
I also made an interactive tool using this natality data that reports a lot of the information the CDC captures in various ways:
this one has a ton of details on which immigrants live where and their health inputs and outcomes: https://theusaindata.pythonanywhere.com/enclave_health
And this one aggregates a lot of that same data up to the sending-country level: https://theusaindata.pythonanywhere.com/immigrant_paradox
Really interesting! But the kinds of annual changes that make big shifts in continent of origin of immigrants over three years might also cause major changes in the gender composition, and age, and natality intentions of those immigrants too, so I would be very careful about estimating changes in total immigration based on changes in births to foreign born mothers.
It's just another data point. I think it has value for several reasons and one of those is that it's just a very different data source. That independence can add value, but also adds risks as you're pointing out.
If I was tasked with getting a more accurate answer next week, my overall strategy would be to try to gather several more independent data sources and triangulate or average or blend them together. If one source is way out of line with the others, then something might be wrong with that source, like the shifts your talking about. But you have to collect a lot of sources to really know which are out of line :)
These are really cool tools, great job!
One need not be a nativist or anti-immigration to think that it is damning that federal agencies have such wildly different estimates of annual immigration that they differ by three times. That is an implicit confession from the federal government that we have poor control over our borders. If the GOP weren't the stupid party, it would be talking about that one factoid around the clock.
It is also objectively nuts that "Questions about birthplace and citizenship are not currently asked on the decennial Census." (Yes, I understand why -- in theory not asking about that stuff drives compliance -- but still, c'mon, man.)
I think you're confused about the direction of the error. The administrative data, from the Border Patrol and Immigration, has the high count, while the survey data, which could capture people who come in illegally and are not caught, has the low count.
That just happened in the last year or so. Prior to that the estimates were a lot closer. So let's not be too quick to "damn" them but, as Jed says, recognize that something went out of kilter recently and we need to dive into that and figure out why.
It's data. Sometimes they're just messy and work malevolently to fool you.
This reflects a misunderstanding of the purpose of these data products. They're not boarder security reporter cards, and if you try to use them as such, you'll find they're not well suited.
You bring up the decennial Census not asking questions about birthplace or citizenship and seem to understand the reason why. Yet you object, because you want the decennial Census to be something it isn't. If you want to add language about quantifying levels of illegal immigration to their mission statement, go for it. But at some point, when you're creating data products as large and complex as the ones we're talking about, you have to make hard decisions about priorities and goal. Do you want to support constitutionally required Congressional redistricting by getting as accurate a count of the number of people in different places? Or do you want to try to count the immigrants? These goals are, to a degree, at odds, and leaders must prioritize.
You are wrong. I just read the relevant text in the Constitution, Article 1, Section 2:
" The actual Enumeration shall be made within three Years after the first Meeting
of the Congress of the United States, and within every subsequent Term of ten Years, in
such Manner as they shall by Law direct, with the primary aim of counting criminals and mental asylum patients entering these United States from Mexico."
I am not sure that's the only reason they don't ask about citizenship on the census. There was a ton of litigation on this during the Trump administration. Liberal groups vehemently opposed asking and I assume that vehemence was based on the political implications of such a question in allocating seats in Congress, not merely a concern about compliance with the census.
Well, the two connect. Less compliance with the census leads to the (potential) change is representation.
"That is an implicit confession from the federal government that we have poor control over our borders"
You're reifying "the federal government." That conceals more than it reveals.
Does it need to be this hard? Like, my take away is that we have the full might of the worlds wealthiest government with its most powerful and far reaching intelligence collection agencies and we are just sort of the Spider-Man pointing at each other meme when it comes to collecting immigration information?
This kind of stuff makes me want to run screaming into the woods.
I think this is an excellent example of something to defer to the experts on. If the well-intentioned economist who was an under secretary of commerce says it's hard, it's probably pretty hard!
I’m not disputing that it’s hard, I’m saying it seems like something we have the intelligence and resources to fix if we really wanted to do that.
The, "if we really wanted to do that" is the key. We do want and need to know things like how many people are in the country, and what is the size and shape of the labor force. But we want to know those things as tools to accomplish other tasks, not as ends in themselves.
Having ever sharper drill bits helps but is neither necessary nor sufficient for the Slow Boring of Hard Boards which is really the ultimate task the government needs to accomplish.
Is it harder or easier than developing a vaccine, building a high speed rail system, making everyone pay their taxes, or convincing people not to vote for a lying, cheating, sexually assaulting maniac?
Anything involving millions of people is hard, especially if you want it done accurately, and without violating a lot of basic assumptions about rights.
Experts making forward looking predictions might easily differ by 2.5x, especially when exponential growth is involved. However, when experts differ by 2.5x on a current headcount, Im pretty confident I could do better. Give me a staff and a budget and I could definitely do better.
These government experts don’t have much skin in the game. They don’t profit for being right. They aren’t really punished for just sticking to the old methodology on auto pilot. They just put in their 40 hours and get their step increases. There is no incentive to hit it out of the park, and strong incentives to keep your head down.
"Give me a staff and a budget and I could definitely do better." ... "There is no incentive"
Do you mean to say you would do better because of your talents or you would do better because someone would give you incentives?
If the issue is incentives, then David Abbott is not the key piece of the puzzle. If the issue is you're more talented then why would you have incentives that the staff experts don't have?
In either case - the experts don't have the staff and the budgets. They're using data that happens to be collected for other purposes and applying it to a more niche problem that it wasn't designed for. If you give me a staff and a budget that was specifically designed to count immigrants I guess I could also do better, but so could anyone.
I think the positions of "this is a difficult issue with confounding incentives and perennially small budgets" and "even still, estimates that are off of each other by 300% are a bit ridiculous, and we should do better" are perfectly compatible positions to hold at once.
I wonder how peer countries are doing? I have little data, but I would bet their estimates are much more accurate, and obtained for similar costs or less. As is the case with high speed rail. Or health care. Or automated tax filing. Or...
It seems to often be the case that we have local experts confidently claiming that X can't be done, while every other OECD country has been outperforming us at X for decades.
If you want a “peer country” (there aren’t real peer countries to the US), Sweden’s system is so bad that there is disagreement as to whether or not the net emigration this year is real or a statistical artifact of investigating suspected welfare fraud and the state has been unknowingly giving benefits to people living outside of the country (eg the Iraqi defense minister: https://en.wikipedia.org/wiki/Najah_al-Shammari).
I find the definitions of "peer countries" to usually be quite arbitrary and not well thought out. I wouldn't call Australia, Japan or NZ "peer countries" here, because their immigration situation is entirely different as island nations.
Countries in Asia also have very different takes on civil liberties, and, in common with Europe, as usual, have been severely underperforming the US in terms of assimilating new immigrants for decades.
Another reason I wouldn't expect Europe to be any better, if they can even be considered peers, is they, as usual, have far stricter privacy laws, which for decades have made much of their civic and economic progress ludicrously difficult.
In terms of response rate, this is one where the peer countries would seem to have much in common with the origin country of immigrants and the ancestral origin of US citizens. Response rates drive the accuracy of census data, and response rates in the US are highly linked to ancestries, ie, citizens and immigrants linked to W Europe have census response rates similar to W Europe, citizens linked to ancestry or origin in Mexico have response rates similar to Mexico. So if we're talking about the SW US, in many ways Latin America is our closest peer.
Wait until you get into the weeds of the Dictionary of Occupational Titles, what’s supposed to be a comprehensive look at our labor force for purposes like disability claims, but which hasn’t been updated since 1991 despite endless federal committees dedicated to updating it, is full of outdated jobs like “carnival weight-guesser” and “telegraph operator”, and which causes interminable appeals as two experts who rely on it can confidently disagree on whether 5,000 or 500,000 of a particular job currently exist.
They better be *rewarded* for sticking to the old methodology on autopilot, and *punished* for making changes in methodology that aren’t amply tested and validated and understood in a way that minimizes incomparability of data from year to year.
You clearly haven’t worked with data collected at a large scale before. All these problems are not unique to government and are things that data scientists and data engineers deal with to varying degrees at private companies large and small. Data collection and analysis is just way more complicated to actually do than it is to talk about!
But it doesn’t seem like we are actually trying to collect it in an organized way? We are purposely keeping things siloed.
I understand this is a hard problem and there are smart people dedicated to making the best of an imperfect system here.
Different silos are trying to measure different things, and care about different types of errors differently. There’s no way that reconciled numbers will serve any of these purposes as well as the different estimates do, even though those of us outside the agencies might have interests that are better served by reconciled numbers.
I remember when the CIA World Factbook was one of everyone's go-to resources for global and economic data and how lacking that was in retrospect. And it's not like the private sector necessarily had something better sitting around, especially considering how many private sector economists rely on census data and other government sources.
It's completely unsurprising. The "full might of the worlds wealthiest government" also is incapable of building high speed rail, balancing its budget, winning a war against a much smaller Third World country, controlling the number of people who cross our border, and keeping health care costs to less than 1.5X per capita what anyone else in this world pays.
"....the full might of the worlds wealthiest government with its most powerful and far reaching intelligence collection agencies..."
We also have other priorities.
Turns out, hard things are hard.
Also, "the full might of the world's wealthiest government" isn't working on this. The Census Bureau, for example, is around 0.2% of the Federal Budget. And its subject to all sorts of restrictions and limitations, many (all?) of which you wouldn't at all want lifted.
Trying to estimate immigration levels using government survey data when a majority of new arrivals are illegal is insane. It’s like trying to estimate drug dealing by giving 50 million people a survey and asking “how many times did deal crack last year? how about meth?”
All of the technical “refinements” built on these idiotic assumptions are deck chairs on the titanic. An honest estimate would look for places that hire illegal immigrants and covertly document employment levels and layoff patterns. You would need other data points, but relying on people who are here illegally to fill out a survey and not trying really hard to capture non responders is not intellectually honest.
"An honest estimate would look for places that hire illegal immigrants and covertly document employment levels and layoff patterns"
This would pretty obviously also have its own flaws. In any case, there's no obvious way to solve the problem of counting people that are undocumented, short of ramping up or improving on existing counting methods. Commenters are acting like this is easy.
I never said this is easy, far from it. I’m not sure it’s possible to consistently come within 10 or even 15% of the true figure. However. for two estimates to be off by 2.5x is like Rasmussen saying Harris has 25% of the vote and the NYTimes saying she has 62%. Someone’s methodology is crazy.
It’s not like that, because a Rasmussen survey is measuring percentages, not absolute numbers. Estimates of percentages that vary by 37 percentage points are very different from estimates of totals that differ be a factor of 3.
Both could result from sampling mechanisms that happen to have very large bias for the quantity in question.
But the fact that this occurred just in the past year or two suggests that the methodologies were working pretty well until recently.
1. Most people who arrive in the US are not illegal immigrants.
2. People who come to the US and apply for asylum are not illegal immigrants.
3. Survey data is by far the best method for estimating drug use.
"Most immigrants (77%) are in the country legally. As of 2022:
49% were naturalized U.S. citizens.
24% were lawful permanent residents.
4% were legal temporary residents.
23% were unauthorized immigrants."
https://www.pewresearch.org/short-reads/2024/07/22/key-findings-about-us-immigrants/
That doesn’t imply most new arrivals are here legally. People without visas probably stay less than people with visas.
You are the one claiming to know what proportion of immigrants are in the country legally. I’m saying it’s hard to measure.
Fwiw, there was a huge increase in 2022/23 in the number of unauthorized immigrants apprehended near the border. That certainly suggests the proportion of new immigrants who are unauthorized has increased.
https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters
You're the one claiming that "a majority of new arrivals are illegal." Either back that up or don't. Saying that there's been a year-to-year increase in unauthorized border crossings in no way actually backs up what you claimed.
To the specific point its pretty clear that most immigration over the last couple of years has not been primarily planned and approved immigration. Its mostly been a mix of illegal and asylum claims. The latter is in a grey zone where its not illegal exactly, but given the abuse of asylum claims its not really what was intended either.
See here for a visualization and more detail using the CBO estimates:
https://www.wsj.com/economy/how-immigration-remade-the-u-s-labor-force-716c18ee?st=4RtbLW&reflink=desktopwebshare_permalink
I'm a bit of a immigration maximalist, so would solve for this by opening up legal immigration dramatically, but we should deal with the facts that are, not that facts we might want.
that is a selective and tendentious demand for rigor
“a majority” is much more humble than “77%”
Fine, show your data to back up your claims.
> trying to estimate drug dealing by giving 50 million people a survey and asking “how many times did deal crack last year? how about meth?”
This is an… entirely valid way to collect information about drug dealing.
Can you elaborate?
Asking people how often they engage in a behavior is an extremely well-established way to estimate the prevalence of said behavior. It's not perfect, but it's extremely widely used.
Do you have a better method for estimating drug dealing? In particular, a method that would yield an upper bound rather than a lower bound?
You’d also want to just randomly survey sewage for metabolites. If overall concentrations are too low to measure, you could even check the sewage of high use places like hotels and night clubs. This would be expensive but if you really want to get the number right, you just find as many plausibly important parameters as possible, see if they provide meaningful data, and then argue over what coefficient each should have.
whether one parameter gets 20% or 25% of the weight generally won’t affect the estimates by much.
the conceit of survey data is its very good at pretending to be comprehensive and subject only to random sampling error. can can easily ask people anything. you can’t conveniently measure the sewage at all the night clubs, it’s expensive and it’s obviously only one piece of the puzzle. that sort of data is very valuable, but only once you give up on an exact answer and commit to looking at multiple parameters
i’d start by looking at overdose and dui deaths. i’d also look at arrest data, interview narcotics detectives in major metro areas and mash them together to come up with some sort of aggregate
Ever since the 2020 census revised NYC’s population upward by around 5% (along with many other urban core areas) I have been relatively skeptical of how the ACS weights work around inner cities; not coincidentally many immigrants tend to concentrate around the urban core. I was hoping the CB had done some work to unbias ACS estimates in the future but…maybe not?
Needless to say the rollout of DP has really shaken my confidence in census bureau decision making.
DP?
Differential privacy (that's the privacy rules reducing accuracy mentioned in the article).
There a re a fair number of commenters that seem disturbed at the idea that we don't do this much better, and wonder why we can't/don't.
But that overlooks why we're trying to do this at all. We do want and need to know things like how many people are in the country, and what is the size and shape of the labor force. But we want to know those things as tools to accomplish other tasks, not as ends in themselves.
Having ever sharper drill bits helps, but is neither necessary nor sufficient for the Slow Boring of Hard Boards, which is really the ultimate task the government needs to accomplish. You could spend an enormous amount of time sharpening your drill bits. That doesn't actually get you a hole in the wood.
I was genuinely confused about whether illegal/unofficial immigration was included in this explanation ("illegal", "undocumented", and "asylum" do not appear at all in the piece) until I saw a mention of border encounters.
Fo the last twenty years at least the estimate of illegal immigrants in the country has been 11 million. This number is invariably trotted out and seems to be carved in stone
Well, it has fluctuated between just over 10M and just under 12M, but, yeah, it's been 11-ish.
Not really: https://www.pewresearch.org/short-reads/2024/07/22/what-we-know-about-unauthorized-immigrants-living-in-the-us/
It looks like there might have been some major change in the 1980s leading to a greater rate of influx. It then took 20 years for that to build up until the rate of outflow roughly equalized influx. There have been smaller changes since then leading to increases and decreases in what the rough steady state would be, but it seems likely that the numbers in the 1990s were so low only because we hadn’t yet reached steady state.
Count me as another person who is surprised that this is so hard.
I get that estimating the number of illegal/undocumented immigrants is hard. But legal immigrants? Legal immigrants are documented to within an inch of their lives. Before I could enter the US as an F-1 student, I had to submit the proper paperwork. My passport was stamped when I entered (in Blaine, Washington), and then, as a condition of remaining an F-1 student in good standing, I had to inform the USCIS (US Citizenship and Immigration Services) within 30 days if I changed my address. The USCIS knew who I was and where I lived. Then, when I applied for my green card, that meant a bunch more paperwork, and when I applied for and eventually received my citizenship, there was a bunch of paperwork too (including a shiny new naturalization certificate).
How hard is it to keep track of all this info in some database? If you want to anonymize it for research purposes, changing "drosophilist became a US citizen on July 12, 2018" to "A39815Z22GJ09 became a US citizen on July 12, 2018" should be pretty trivial, no?
Good question:
One part of the answer may be that the use cases are most interested in how many residents we have, regardless of the various statuses. So we want to know how many US citizens currently live here, and not accidentally count those who have left (to work in another country or just died). Ditto for legal immigrants who are in various stages of the visa , permanent residency or citizenship pathway.
It's not even all that easy to track how which native-born citizens have died or left the country, let alone these other categories. Maybe fully legal, naturalized citizens would be some of the easiest to track, but it still could be tough.
Great choice, thank you for running it. I hope readers keep this column in mind when Matt writes about macroeconomics. If the inputs to macroeconomic calculations are demonstrably very error-prone ...
Except they’re not! In one or two years out of several decades, there have been major differences in two estimates. But most of the time the two are pretty close. Maybe it’s more error-prone than you imagined, but it’s less error-prone than people who actually do data collection might have guessed!
Say I'm here in the US illegally and someone from the "ACS" (whatever *that* is) is in front of me asking me if I'm a citizen and where I come from. How would I answer?
Trick question! I'm doing my roadrunner imitation and beating a hasty retreat before he opens his mouth.
And yet, we know that people do in fact answer. It's not like the survey designers don't think ahead of time about the challenges they face and try to counteract them.
There is a world in which an accurate count of migrant flows for the US has been tried and found hard. There is also a world in which it has been found inconvenient and left largely untried.
How do I, as a data nerd, tell the difference between these two worlds?
Accessible-but-dry, a few too many weeds (had trouble keeping all the acronyms straight, for example); lot of ink to answer "no" to the headline question. Avoids the obvious followup of "yes, correctly controlling the labour demand knob is important, but what about the labour supply knob?"...which is certainly one way to assign the dependent variable, I guess. But I do appreciate hearing from a SME. James Scott would have loved this post, RIP. *pours one out*