Building a large AI model today can cost hundreds of millions of dollars, with predictions that it could reach staggering billions of dollars within a few years. Much of that cost is in the computing power of specialized chips (typically Nvidia GPUs), which can cost tens of thousands of them, costing $30,000 each.
But companies training AI models or tweaking existing models to improve performance for specific tasks also struggle with another often-overlooked cost: data labeling, the painstaking process of training generative AI models with tagged data so they can recognize and interpret patterns.
For example, data labeling has been used for years to develop AI models for self-driving cars: cameras capture images of pedestrians, road signs, cars, and traffic lights, and human annotators label the images with words like “pedestrian,” “truck,” and “stop sign.” This labor-intensive process has also raised ethical concerns. After releasing ChatGPT in 2022, OpenAI was widely criticized for outsourcing data labeling work that helped make chatbots less harmful for Kenyans who earn less than $2 an hour.
Today's general-purpose large-scale language models (LLMs) go through a data labeling effort called reinforcement learning human feedback, where humans provide qualitative feedback and rankings on what the model produces. This is one of the major drivers of rising costs, as is the effort required to label any personal data that companies want to incorporate into AI models, such as customer information or internal company data.
Additionally, highly technical and specialized data labeling is driving up costs in fields like law, finance, and healthcare, as some companies hire expensive doctors, lawyers, PhDs, and scientists to label certain data, or outsource the work to third-party companies like Scale AI, which recently secured a staggering $1 billion in funding after its CEO predicted significant revenue growth by the end of the year.
“Labeling requires lawyers, which is a waste of legal time,” said William Falcon, CEO of AI development platform Lightning AI. “Anything that poses significant risk” requires expert-level labeling, he explained. “Chatting with your 'virtual best friend' doesn't pose significant risk, but providing legal advice does.”
Alex Ratner, CEO of data-labeling startup Snorkel AI, said enterprise customers can spend millions of dollars on data labeling and other data-related tasks, wasting time and 80% of their AI budgets, adding that over time, data also needs to be re-labeled to keep it up to date.
Matt Schumer, CEO and co-founder of AI assistant startup Otherside AI, agreed that fine-tuning L.M. programs has become costly. “Over the last few years, we've gone from middle school level data being sufficient to high school, college, and now professional level data,” he said. “And it's certainly not cheap.”
That can be a budget headache for tech startups operating in a critical sector like healthcare. Neil Shah, CEO of elderly caregiver platform CareYaYa, which received a grant from Johns Hopkins University to develop “the world's first AI caregiver trainer for dementia patients,” said the cost of data labeling “has been killing us.” The cost has skyrocketed 40% in the past year, he said, because it requires expert input from gerontologists, dementia experts and veteran caregivers. He's working to reduce the cost by asking healthcare students and professors to label the data.
Bob Rogers, CEO of Oii.ai, a data science company that specializes in supply chain modeling, said he has seen data labeling projects that cost millions of dollars. He said a platform like BeeKeeper AI can reduce costs by allowing multiple companies to share experts, data and algorithms without exposing their private data to others.
Kjell Carlsson, head of AI strategy at Domino Data Lab, added that some companies are reducing costs by at least partially automating data collection and labeling using “synthetic” data, i.e. data generated by the AI itself. In some cases, the models can fully automate data labeling. For example, biopharmaceutical companies are training generative AI models to develop synthetic proteins for conditions such as colon cancer, diabetes and heart disease. Companies automatically run experiments based on the output of generative AI models that provide new training data that comes with labels.
But in the end, while data labeling may be costly and time-consuming, it's well worth it. “Data labeling is hard work,” says CareYaYa's Shah, “but the payoff is huge.”
Sharon Goldman
If you have any comments or suggestions regarding the datasheet, please write them here.
Newsworthy
DeepMind protests against the military. According to Time magazine, around 200 DeepMind staffers are calling for Google's AI division to stop working with the military. The letter to management says that Google's cloud business is violating the company's rules by selling AI to armies at war. No names are mentioned, but there are links to reports about Google's dealings with the Israeli military and (alleged) Israeli arms companies. Google claims that only Israeli government ministries use its cloud services, and not for “military work related to weapons or intelligence.”
China's Amazon route. Reuters reports that Chinese government agencies are using Amazon's cloud services to access the kind of advanced chips and AI that U.S. export controls are trying to keep out of China. U.S. regulations prohibit the export or transfer of advanced chips and AI software to Chinese companies, but allow access via the cloud. Amazon Web Services says it's not violating any rules.
Cruise + Uber. GM's Cruise robotaxi unit, which is trying to recover from a serious setback, has signed a deal with Uber to provide self-driving services in an unnamed U.S. city, the Financial Times reports. Uber already has a similar deal with Alphabet's Waymo to provide robotaxi services in Phoenix. Cruise doesn't currently offer a self-driving service, however; it is still testing with human drivers after a long hiatus following the death of a pedestrian who was crushed by a Cruise vehicle in San Francisco.
Our Feed
“There are some interesting use cases, but overall it seems like a lot of caution is needed on this front, especially in larger enterprises with complex permissions like SharePoint and Office 365. Copilot basically proactively summarises information that you may technically have access to but shouldn't have access to.”
—Jack Berkowitz, chief data officer at Securiti, told The Register that half of the peers he surveyed had paused their deployments of Microsoft's Copilot, an AI assistant he claims is accessing internal company data it shouldn't.
In case you missed it
AI is enabling self-driving cars, so why is the industry standing back? by Sage Lazarus
Alibaba has upgraded its Hong Kong listing to primary, which could bring in billions of dollars in new investment, Lionel Lim said.
The stranded Boeing Starliner astronauts were meant to fly home on a SpaceX flight, but their spacesuits didn't fit Elon Musk's spacecraft, by Marco Quiroz Gutierrez
A California woman outwitted two alleged mail thieves by sending herself an AirTag, reports the Associated Press.
I sold a $1.4 billion big data startup to IBM and then founded a nature preserve. The Perils of AI's Energy Consumption, by Chris Gladwin (Commentary)
Before you leave
Jelly Pong. Scientists have managed to teach a “soft, fluffy, water-rich gel” how to play the classic video game Pong, reports The Guardian. What's more, the hydrogel is not sentient, but it does have a memory, so it actually gets better at the game over time, the British researchers say. But the jelly-like material isn't as good at Pong as another system based on a bunch of neurons in a dish, unveiled a few years ago. That system is, delightfully, named DishBrain.