Every ten years, the U.S. Census Bureau counts Americans. It tries to strike a balance between accurate information and privacy. Current technology can reveal a person’s transgender identity by linking seemingly anonymous information such as their age and neighborhood to find out if their sex was reported differently in subsequent censuses. Trans people and their families living in states that criminalize them could be affected by the ability to deanonymize gender or other data.
In places like Texas where trans families may be seeking medical treatment for their children, the state will need to know who these teenagers are to conduct its investigations. We were concerned that census data could be used for this type of investigation and punishment. Could a weakness in the anonymization of publicly available data be exploited to find trans children and punish their families? This is a similar concern that underscored the public outcry in 2018 over the census asking people to reveal their citizenship–that the data would be used to find people living in the U.S. illegally to punish them.
Using our data science and data ethics expertise, we created simulated data to imitate the Census Bureau’s data sets and attempted to identify trans teenagers. With the data-anonymization approach the Census Bureau used in 2010, we were able to identify 605 trans kids. The Census Bureau is currently working on a new differential privacy approach, which will improve privacy overall. When we reviewed the most recent data released, we found the bureau’s new approach cuts the identification rate by 70 percent–a lot better, but still with room for improvement.
We are researchers who use census data to answer questions regarding life in the U.S. and we strongly believe that privacy is important. The bureau is currently undertaking a public comment period on designing the 2030 census. The bureau is currently seeking public comments on the design of the 2030 census. This is why it is so important.
The federal government uses census data to determine the size and shape, or how to distribute funding. The data isn’t just used by government agencies. The data is used by researchers in many fields, including economics and public healthcare, to analyze the state of the nation and make recommendations for policy.
But the risks of deanonymizing data is real, and they are not limited to trans children. It might be possible to remove privacy protections that Census Bureau has built into data in a world with increasing private data collection and easy access to powerful computing systems. Perhaps most famously, computer scientist Latanya Sweeney showed that almost 90 percent of U.S. citizens could be reidentified from just their ZIP code, date of birth and assigned sex.
In August of 2021, the Census Bureau responded. The organization used the cryptographer-preferred approach of differential privacy to protect its redistricting data. Mathematicians and computer scientists have been drawn to the mathematical elegance of this approach, which involves intentionally introducing a controlled amount of error into key census counts and then cleaning up the results to ensure they remain internally consistent. For example, if the census counted precisely 16,147 people who identified as Native American in a specific county, it might report a number that is close but different, like 16,171. Although it sounds easy, counties are made up census tracts which are made up census blocks. This means that in order to get close to the original count, census must also adjust the number of Native Americans in each census tract and block. The Census Bureau’s method is to add all these close-but different numbers to a single number.
One might think that protecting privacy is an easy task. Some researchers, primarily those whose work is dependent on the existing data privacy approach, think otherwise. These changes will make it more difficult for researchers to do their job in practice, while the privacy risks that the Census Bureau protects against are largely theoretical, they argue.
Remember that we have shown that the risk is not just theoretical. Here are some details about how we did it.
We reconstructed a complete list of people under the age of 18 in each census block so that we could learn what their age, sex, race and ethnicity was in 2010. Then we matched this list up with the analogous list in 2020 to find people now 10 years older and with a different reported sex. This method, called a reconstruction-abetted linkage attack, requires only publicly released data sets. It was reviewed by the census and presented to them in writing. The results were robust enough to encourage researchers from Harvard University and Boston University to get in touch with us to learn more about our work.
We simulated what a bad actor might do, so how can we ensure that attacks like these don’t occur? This privacy aspect is being taken seriously by the Census Bureau, and researchers who use these data should not be in jeopardy.
The census was collected with great labor and great expense. We all stand to benefit from the data produced by this effort. These data can also cause harm. The Census Bureau’s efforts to protect privacy have made a significant contribution in mitigating this risk. We must encourage them.
This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.