In June, Mustafa Suleiman, CEO of Microsoft's new AI division, made a startling claim: He told CNBC's Andrew Ross that anything you publish on the internet becomes “freeware” that can be copied and used to train AI models. In recent weeks, there has been a lot of scrutiny and reporting about generative AI companies potentially pulling videos and transcripts from YouTube and using the work of independent creators to train their AI models. In July, online publication 404 Media exposed how generative AI video company Runway had trained its models on thousands of videos without permission.
In recent months, the issue of YouTube content being used to train generative AI models has been the subject of heated debate among the creator community. It's a complex issue that touches on serious grounds such as consent, compensation, and creator rights. In this article, we explore the issue, what big tech companies have to say, and how training AI models on YouTube content impacts creators.
Why is this such a hotly debated issue among creatives?
The field of generative AI is rapidly evolving, and companies need access to vast amounts of data to create more powerful models with performance and efficiency. A concern in the creator community is that their videos are being used to train these large-scale AI models without their explicit permission.
Several recent investigative reports suggest that AI companies are using large amounts of content from YouTube, including audio, video, and transcripts, to develop their own models. While none of the major tech companies have publicly admitted to this, such practices raise some serious ethical, legal, and financial questions, and many creators feel uneasy and, in some cases, exploitative. This month, YouTuber David Millett filed a lawsuit against chipmaker Nvidia, alleging that the company scraped YouTube content to create video models without any permission from creators.
Similarly, a July study by Proof News, a data-driven reporting and analytics portal, revealed that subtitles of 1,73,536 YouTube videos from over 48,000 channels were used to train models by tech giants such as Nvidia, Apple, Anthropic and Salesforce. According to the report, these subtitles include video transcripts from online learning platforms such as Harvard, MIT and Khan Academy. The portal created a tool for content creators to check if their work is included in the YouTube AI training dataset. According to the report, videos from popular creators such as Marques Brownlee, MrBeast and PewDiePie were also used to train AI models.
What are the main issues?
For many YouTubers, the biggest concern is that their content is being used to train AI models without their explicit permission. Simply put, when creators upload videos to YouTube, they essentially agree to the Terms of Service, which grants YouTube a broad license to use your content. According to the Terms of Service, YouTube can reproduce, distribute, and even create derivative works from your content. However, nowhere does it state that your content can also be used to train AI models. It's important to note that this use case did not exist when the Terms of Service were first drafted.
“By submitting content to the Service, you grant YouTube a worldwide, non-exclusive, royalty-free, transferable, sublicensable license to use (including copying, distributing, creating derivative works from, displaying and performing) that content. YouTube may use that content solely in connection with the Service and YouTube's (and its successors' and affiliates') business, including for promoting and redistributing part or all of the Service.” This is an excerpt from YouTube's terms of service as currently found.
While the terms here seem clear, they are still vague. This lack of clarity has many creators feeling uneasy. According to news reports and social media posts, many creators believe that if their content is valuable enough to be used to train AI models that cost billions of dollars, they should be compensated accordingly. At a time when companies are signing large contracts to use their data to train AI models, smaller creators seem to be left behind without any recognition or compensation for their content.
What are technology leaders saying?
Asked whether YouTube content was being used to train Sora and whether that violated its policies, the platform's CEO, Neil Mohan, said that some creators' contracts with the platform allow for that content to be used.
“When creators upload their hard work to our platform, they have certain expectations, and one of those expectations is that our terms of service will be upheld. In our terms of service, we do allow for the removal of some YouTube content, like video titles, channel names, and creator names, because that's how we achieve an open web. But we don't allow for the downloading of transcripts or parts of videos, which is a clear violation of our terms of service,” Mohan told Emily Chang in an interview in May.
Similarly, Suleiman said in a CNBC interview, “When it comes to content that's already on the open web, I think the social contract for that content since the '90s has been fair use. Anyone can copy it, recreate it, reproduce it, they've been free to do whatever they want with it.” Meanwhile, OpenAI CTO Mira Murati looked puzzled when asked the same thing in a WSJ interview in March. When pressed, she concluded, “I won't go into the details of the data that was used, but it was publicly available or licensed data.”
What is the legal position?
When it comes to training AI models, the legal situation seems murky. Companies like Google may claim that their broad licenses allow them to use YouTube content for AI training. However, this is still unclear and legally debatable. At the moment, there are numerous lawsuits challenging the legality of using copyrighted content for AI training without explicit permission from the creator.
Beyond the legal issues, there are also ethical concerns: creators care about their work and most are uncomfortable with the idea of their content being used in ways they never imagined. The very idea of an AI generating new content from original work without consent seems to many like an infringement on creators' creativity and skill.
Now, the rapid advances being made in AI mean that huge datasets will be increasingly needed to power AI models. This puts creators in a tricky position: if video-sharing platforms like YouTube use their content for AI training without their consent, individual creators could lose control over their work. This certainly points to a broader issue of power imbalances, especially between big corporations and individuals. Big tech companies can easily navigate legal complexities, but conversely, independent creators have fewer resources to protect their rights.
As this issue gains momentum, YouTube creators need to be informed and voice their concerns. They should demand more transparency from the platform about how their content is being used, especially when it comes to training AI models. Currently, Elon Musk's Grok allows users to opt out of having their interactions with its chatbot used for AI training. This is a great way to bring transparency, and YouTube creators should be offered a similar opt-out option.