Data has become treasure for public agencies. The more data, the better — with more precise and efficient services to follow, at least according to the current theory powering many government technology upgrades. But at the same time, data and the privacy of individuals must be protected.
These dual needs have created an opening for what’s called synthetic data, which in the simplest terms is algorithm-generated data meant to copy the statistical properties of actual data.
Researchers, businesses and others are already using synthetic data to protect personal privacy, compensate for low supplies of real data, control costs and even guard against bias. And synthetic data has found uses in fraud prevention, census research, finance, marketing, retail and autonomous vehicle development. Synthetic data also helps train machine learning and AI models, boosts quality in manufacturing and improves supply chains. Clinical trials in health care increasingly rely on artificial data sets as well to protect privacy while advancing science.
Though state and local governments don’t use as much synthetic data as other operations, the rise of AI, intelligent transit systems and other technology will bolster its use in the public sector, some experts say.
“Everything government does takes data,” said Christopher Bramwell, chief privacy officer for Utah. “Synthetic data is a new frontier. It will explode.”
For that to happen, there must first be education about synthetic data, as well as its role in privacy and its potential limitations. Shining the spotlight on how it has been used in the government space could also advance its case and spark more projects anchored to those particular treasure chests of data.
WHAT IS SYNTHETIC DATA?
One of the most succinct definitions of synthetic data comes from IBM: “Synthetic data is artificial data designed to mimic real-world data.”
That form of data still “retains the underlying statistical properties of the original data that it is based on,” IBM explains. “As such, synthetic datasets can supplement or even replace real datasets.”
A real-world example from government technology also helps to shine the light on synthetic data. It comes from Replica, a spinoff of Google-affiliated Sidewalk Labs (now defunct) that focuses on traffic management data.
Replica uses data that has been scrubbed clean of properties that can identify locations, and it runs that information through various models, which are based on what’s called “synthetic populations.” Despite all that apparent fakery — all the replication — the combined information can help public-sector clients to study traffic patterns and gain other insights.
EV charging patterns, commuting habits of local residents, biking habits — synthetic data can help with analysis of all those issues, according to Replica, and do so without invading anyone’s privacy, according to the company and other data experts. Privacy, after all, is more than one’s Social Security and banking numbers. Without proper guardrails — or, backers say, responsible use of synthetic data — shifting through patterns offered by data sets can lead to privacy invasions.
There is significant money involved, too, with synthetic data.
Replica, founded in 2019, has raised at least $52 million, according to Crunchbase, including a $41 million Series B funding round. More capital is sure to find its way to synthetic data, too. Research and advisory firm Gartner, for instance, predicts that by 2026, 75 percent of businesses will use generative AI to craft synthetic data.
Such data might seem brand new — it’s barely starting to break through to mainstream awareness, after all — but, much like AI, the origins of synthetic data stretch back longer than one might assume.
In fact, synthetic data has been around since the early 1990s, according to Krish Muralidhar, a professor at the University of Oklahoma known for his expertise in data privacy.
One of the most famous — and controversial — uses of synthetic data involves the U.S. Census Bureau and what the AP called “customized tables tailored to … research” centered around the last census, in 2020. Critics of that type of data worried about errors and manipulation.
The Census Bureau, which still uses synthetic data, saw the situation differently. Synthetic data protected the privacy of individuals while giving the bureau a more precise look at certain trends, including those around income and poverty. The bureau also offered its own definition of this quickly emerging data form.
“Synthetic data can mean many different things depending upon the way they are used,” the Census Bureau said. “Sometimes, as in computer programming, the term means data that are completely simulated for testing purposes. Other times, as in statistics, the term means combining data, often from multiple sources, to produce estimates for more granular populations than any one source can support.”
Even so, suspicion of synthetic data persists.
“Government agencies abhor the notion of synthetic data,” Muralidhar told Government Technology. “To them it was not real.”
To get around that resistance, he said, some backers of synthetic data resorted to euphemism. Instead of using the term “synthetic data,” he said, “we called it data shuffling. We could shuffle data around but preserve statistical relationships.”
No matter what one calls it, synthetic data is all but certain to play a bigger role in the world in the coming years.
“A lot of AI learning models are based on synthetic data because [researchers] don’t necessarily have access to real data per se,” Muralidhar said. “Take health data, which is very restricted. It’s very difficult for health organizations to share their data.”
PURSUING PRIVACY IN UTAH
The health data restrictions point to the privacy issues around synthetic data — more specifically, the view that such data can go a long way toward keeping our most personal information safe from digital predators.
Bramwell, the chief privacy officer for Utah, is an advocate for that view. He has teamed with Utah state Rep. Lisa Shepherd to help advance that view among lawmakers and constituents.
“I am very bullish about synthetic data,” Bramwell said. “We need data to improve government services, and synthetic data can do it in a way that is privacy preserving.”
The act of making data impossible to track back to individual people — a difficult but not impossible task — carries many implications, not the least of which involve transparency and trust. After all, he pointed out, personally identifiable information collected via the Census Bureau helped the U.S. government in World War II identify people who were sent to internment camps, including in Utah.
“Synthetic data has the potential to be the best tool to protect privacy with what’s coming,” he said.
Utah — a state trying to become more attractive to technology companies — actually has a synthetic data policy, part of a recent law regarding consumer privacy. Among other things, the Utah Consumer Privacy Act requires disclosure that “synthetic data generated by generative AI is not personal data,” according to one summary.
“We are one of the only states with a definition about synthetic data,” Bramwell said. “The first step is you have to define it.”
The use of synthetic data among local and state public agencies generally remains much closer to that first step than the second one, though. As Bramwell said, “We are very early in the journey.”
But he can see the outlines of what’s to come. He anticipates that synthetic data, at least for now, will find less use in policy than in testing use cases, including in the health and human services area.
For her part, Shepherd, a rookie state representative, said she plans to keep pushing for clearer standards on digital privacy and synthetic data in the Utah Legislature — another sign that those artificial data sets could soon take up more of the tech spotlight in government.
WHAT'S NEXT?
Earning more backing for synthetic data from officials will take serious work.
That’s according to a 2024 survey from Coleman Parkes as well as data and AI firm SAS, which found that “32 percent of government decision-makers worldwide said they would not consider using synthetic data,” according to a blog from John Gottula, a principal adviser for AI and biostatistics for SAS, as well as a professor at North Carolina A&T University with expertise in agricultural technology.
By contrast, that survey found that 23 percent of respondents from other industries had the same view.
“This reluctance highlights a concerning gap in readiness, as public sector agencies risk falling behind in leveraging AI’s transformative potential,” Gottula wrote.
In an interview with Government Technology, Gottula said that because synthetic data is so new — or seemingly new — its benefits have yet to catch on among state and local officials, to say nothing of the public.
But more tests and use cases could serve as useful education.
For instance, Gottula says generalized, anonymous data can help politicians and educators — and, presumably, voters and taxpayers — get a better handle on school performance and the factors that help determine how well students do in class. That, in turn, can help shape funding. Synthetic data also has uses in agriculture, with Gottula describing a project that is seeking “to understand the landscape of animal antimicrobial resistance,” work that can benefit from having state labs share data in “private ways.”
Europe also offers examples of how to move forward with synthetic data, with smart city and traffic management efforts using those types of data sets.
But like all tech and data sets and related tools, even synthetic data requires ethical oversight. Active testing for synthetic data “look-alikes” can serve as a defense against the type of reverse engineering that can uncover personally identifiable information, he said.
The risk for public agencies that don’t eventually embrace synthetic data, Gottula said, could prove significant: Training and testing AI systems might suffer, as would deeper analysis of existing and proposed policies.