When you go to the hospital and have your blood tested, the results are compiled into a dataset and compared to other patient results and population data. This allows doctors to compare you (blood, age, gender, health history, scans, etc.) with other patient results and histories to predict, manage and develop new treatments.
For centuries, this has been the foundation of scientific research: identify a problem, collect data, look for patterns, and build a model to solve it. The hope is that a type of artificial intelligence (AI) called machine learning, which creates models from data, will be able to do this much faster, more effectively, and more accurately than humans can.
However, training these AI models requires large amounts of data, some of which must be synthetic — that is, data that reproduces existing patterns rather than real data from real people. Most synthetic datasets are generated by machine learning AI.
While the extreme inaccuracies of image generators and chatbots are easy to spot, synthetic data also produces hallucinations: unlikely, biased, or even outright improbable results. Like images and text, they can be entertaining, but the widespread use of these systems in all areas of public life means that they have a huge potential for harm.
What is synthetic data?
AI models need much more data than the real world can provide. Synthetic data offers the solution. Generative AI looks at the statistical distribution of real datasets and creates new synthetic data to train other AI models.
This synthesized “pseudo” data is similar but not identical to the original data, which means it can be used to ensure privacy, circumvent data regulations, and even be freely shared or distributed.
Synthetic data can complement real datasets and can also be large enough to train AI systems, and if the real dataset is biased (for example, too few women, or too many cardigans instead of pullovers), the synthetic data can balance it out. There is an ongoing debate about how far synthetic data can stray from the original data.
Apparent omissions
Without proper curation, tools that create synthetic data will always over-represent what is already dominant in the dataset, and under-represent (or omit) less common “edge cases.”
This is how my interest in synthetic data first began: women and other minorities are already underrepresented in medical research, and I was concerned that synthetic data would exacerbate this problem, so I teamed up with machine learning scientist Dr. Sagi Hajisharif to investigate the phenomenon of disappearing edge cases.
In our study, we used a type of AI called GAN to create a synthetic version of the 1990 U.S. Adult Census data. As expected, the synthetic dataset was missing edge cases: the original data had 40 countries of origin, but the synthetic version had only 31. The synthetic data excluded immigrants from nine countries.
After realizing this error, we were able to tweak our methodology and include it in a new synthetic dataset. It was possible, but it required careful curation.
“Cross Hallucinations” – AI creates impossible data
Then we started noticing something else in the data: cross hallucinations.
Intersectionality is a concept in gender studies. It describes the power relations that produce discrimination and privilege for different people in different ways. It considers not only gender, but also age, race, class, disability, and other factors, and considers the situations in which these factors “intersect.”
This can inform how we analyze synthetic data that includes all data, not just population data, because intersecting aspects of the datasets generate complex combinations of what the data describes.
In our synthetic dataset, the statistical representation of distinct categories was very good. For example, the age distribution was similar in the synthetic and original data. Not identical, but close. This is a good thing, since synthetic data should be similar to the original, not an exact reproduction.
We then analyzed the intersections in the synthetic data. More complex intersections were also reproduced. For example, the intersection of age, income, and gender was reproduced quite accurately in the synthetic dataset. We call this accuracy “intersection fidelity.”
However, we also noticed that there were 333 data points in the synthetic data labeled “husband/wife and single” – a cross hallucination; the AI had not learned (or been taught) that this was not possible. Over 100 of these data points were “unmarried and husband making less than $50,000 a year,” a cross hallucination that did not exist in the original data.
Meanwhile, the original data set contained multiple “widowed women working in tech support” who were completely absent from the synthetic version.
That is, our synthetic dataset can be used to study questions about age, income, and gender (where there is cross-fidelity), but not if you are interested in “widowed women who work in tech support.” You should also be careful about whether your results include “unmarried husbands.”
The big question is, where does this stop? These hallucinations are the intersection of two-part and three-part, but what about the intersection of four-part? Or maybe five-part? At what point (and to what purpose) does synthetic data become irrelevant, misleading, useless, or dangerous?
Embrace the crossover illusion
Structured datasets exist because the relationships between columns in a spreadsheet give us useful information. Think of blood tests: doctors want to know how a patient's blood differs from normal blood, or from other diseases or treatment outcomes. This is why we organize data in the first place, and why we have done so for centuries.
But when using synthetic data, cross hallucinations always occur: the synthetic data needs to be slightly different from the original data, or it will just be a copy of the original. So synthetic data needs hallucinations, but only the right kind of hallucinations: ones that amplify or extend the dataset, and don't create something impossible, misleading, or biased.
The existence of cross-hallucinations means that one synthetic dataset cannot be used for many different applications: each use case requires a bespoke synthetic dataset with labelled hallucinations, which requires a cognitive system.
Building a reliable AI system
For an AI to be trustworthy, we need to know what cross-hallucinations are present in its training data, especially if it is being used to predict people's behavior or to regulate, govern, treat, or police us. We need to ensure that the AI is not trained on dangerous or misleading cross-hallucinations, such as a 6-year-old doctor receiving a pension.
But what happens when synthetic datasets are used carelessly? Currently, there is no standard way to identify synthetic datasets, and they are often confused with real data. Once you share a dataset for others to use, it is impossible to know whether it can be trusted, what is hallucinatory and what is not. A clear, universally recognizable way to identify synthetic data is needed.
Cross hallucinations may not be as interesting as hands with 15 fingers or recommendations to put glue on pizza. They are boring and unappealing numbers and statistics, but they affect us all. Sooner or later, synthetic data will be everywhere, and by its very nature, cross hallucinations will inevitably be included. Some we want, some we don't, but the problem is distinguishing between them. We need to make this possible before it's too late.
This article is republished from The Conversation under a Creative Commons license. Read the original article.