Building a large AI model today can cost hundreds of millions of dollars, with predictions that it could reach staggering billions of dollars within a few years. Much of that cost is in the computing power of specialized chips (typically Nvidia GPUs), which can cost tens of thousands of them, costing $30,000 each.
But companies training AI models or tweaking existing models to improve performance for specific tasks also struggle with another often-overlooked cost: data labeling, the painstaking process of training generative AI models with tagged data so they can recognize and interpret patterns.
For example, data labeling has been used for years to develop AI models for self-driving cars: cameras capture images of pedestrians, road signs, cars, and traffic lights, and human annotators label the images with words like “pedestrian,” “truck,” and “stop sign.” This labor-intensive process has also raised ethical concerns. After releasing ChatGPT in 2022, OpenAI was widely criticized for outsourcing data labeling work that helped make chatbots less harmful for Kenyans who earn less than $2 an hour.
Today's general-purpose large-scale language models (LLMs) go through a data labeling effort called reinforcement learning human feedback, where humans provide qualitative feedback and rankings on what the model produces. This is one of the major drivers of rising costs, as is the effort required to label any personal data that companies want to incorporate into AI models, such as customer information or internal company data.
Additionally, highly technical and specialized data labeling is driving up costs in fields like law, finance, and healthcare, as some companies hire expensive doctors, lawyers, PhDs, and scientists to label certain data, or outsource the work to third-party companies like Scale AI, which recently secured a staggering $1 billion in funding after its CEO predicted significant revenue growth by the end of the year.
“Labeling requires lawyers, which is a waste of legal time,” said William Falcon, CEO of AI development platform Lightning AI. “Anything that poses significant risk” requires expert-level labeling, he explained. “Chatting with your 'virtual best friend' doesn't pose significant risk, but providing legal advice does.”
Alex Ratner, CEO of data-labeling startup Snorkel AI, said enterprise customers can spend millions of dollars on data labeling and other data-related tasks, wasting time and 80% of their AI budgets, adding that over time, data also needs to be re-labeled to keep it up to date.
The story continues
Matt Schumer, CEO and co-founder of AI assistant startup Otherside AI, agreed that fine-tuning L.M. programs has become costly. “Over the last few years, we've gone from middle school level data being sufficient to high school, college, and now professional level data,” he said. “And it's certainly not cheap.”
That can be a budget headache for tech startups operating in a critical sector like healthcare. Neil Shah, CEO of elderly caregiver platform CareYaya, which received a grant from Johns Hopkins University to develop “the world's first AI caregiver trainer for dementia patients,” said the cost of data labeling “has been killing us.” He said the cost has skyrocketed 40% in the past year because it requires expert input from gerontologists, dementia experts and veteran caregivers. He's working to reduce the cost by asking healthcare students and professors to label the data.
Bob Rogers, CEO of Oii.ai, a data science company that specializes in supply chain modeling, said he has seen data labeling projects that cost millions of dollars. He said a platform like BeeKeeper AI can reduce costs by allowing multiple companies to share experts, data and algorithms without exposing their private data to others.
Kjell Carlsson, head of AI strategy at Domino Data Lab, added that some companies are reducing costs by at least partially automating data collection and labeling using “synthetic” data, i.e. data generated by the AI itself. In some cases, the models can fully automate data labeling. For example, biopharmaceutical companies are training generative AI models to develop synthetic proteins for conditions such as colon cancer, diabetes and heart disease. Companies automatically run experiments based on the output of generative AI models that provide new training data that comes with labels.
But in the end, while data labeling may be costly and time-consuming, it's well worth it. “Data labeling is hard work,” says CareYaya's Shah, “but the payoff is huge.”
Sharon Goldman
If you have any comments or suggestions regarding the datasheet, please write them here.
This story originally appeared on Fortune.com.