There’s been a lot of talk about Sora lately, and for good reason. It’s causing a lot of controversy on many fronts. Some are worried about the jobs it will make obsolete, while others are concerned about how it’s being trained.
OpenAI says they “aren’t sure” if Sora is being trained on YouTube videos. But in an interview with Bloomberg, YouTube CEO Neal Mohan has gone on record. He says that using YouTube videos for training AI models would be a violation of its Terms of Service.
OpenAI’s Sora vs YouTube
Sora seems to be the most talked about thing on the Internet right now, especially since dropping their latest batch of examples. It’s even been used to make a music video. We’ve progressed from focusing on large language models (LLMs) like ChatGPT and single image generation tools like Midourney and DALL-E and moved on to video.
But how is Sora being trained? OpenAI is a company that’s been plagued with controversy for a while now. The company’s had lawsuits filed against it for allegedly training its various models on private data, as well as stolen photos.
Now, it seems, the controversy is back once again with Sora. This is the company’s AI tool for generating video. And it’s come a very long way in a very short space of time – at least publicly. But has it been training its models on YouTube video content?
OpenAI CTO has no idea what they’re doing
In an interview last month, OpenAI CTO Mira Murati said that Sora will be available to the general public at some point during 2024. When specifically asked what data the model was trained on, the Wall Street Journal reports that she became evasive and didn’t go into details.
I’m not going to go into the details of the data that was used, but it was publicly available or licensed data
Mira Murati
She did confirm that they have used content from Shutterstock, with whom they have a partnership. But when pressed, she said she did not know whether it was trained on video content from YouTube, Facebook and Instagram.
Now, call me an old cynic, but to me, that sounds like she absolutely knows the answer, and they absolutely did use videos from YouTube, Facebook and Instagram to train their models. Of course, that’s just my opinion.
YouTube CEO says it’s a 100% breach of their TOS
In an interview with Bloomberg published yesterday, YouTube CEO Neal Mohan was asked if he could confirm whether or not OpenAI has been using YouTube content to train its models. He says that he also doesn’t know.
Giving him the benefit of the doubt, how could he? He doesn’t work for OpenAI, and if OpenAI wanted to hide their activity on YouTube, it wouldn’t be difficult with the assistance of some VPNs to download the video data anonymously. I’m not saying that OpenAI has done this, just that hypothetically, it’s possible.
What he did say, though, was that he’s seen reports stating that it may or may not have been used. He also said that if OpenAI has used YouTube video content to train their models, it’s a breach of YouTube’s terms of service (TOS).
We have a clear terms of service that um, uh, when a… you know… again, from a creator’s perspective, when a creator uploads their, you know, their hard work to our platform, they have certain expectations. One of those expectations is that the terms of service is going to be abided by.
Neal Mohan
He was clearly a little unprepared for the question and unsure how to answer it. When asked if Google trains their own Gemini (formerly Bard) AI on YouTube data, things were a little more vague. He says that they are bound by the same terms of service as OpenAI or any other YouTube user, although they have trained on some YouTube data.
They say that this data has been collected through individual contracts with certain creators on the platform or within the terms of the YouTube TOS – which is different for YouTube/Google than it is for everyone else. Let’s take a look at that YouTube TOS, or at least the parts that may be relevant here.
Rights you Grant
You retain all of your ownership rights in your Content. In short, what belongs to you stays yours. However, we do require you to grant certain rights to YouTube and other users of the Service, as described below.Licence to YouTube
By providing Content to the Service, you grant to YouTube a worldwide, non-exclusive, royalty-free, transferable, sublicensable licence to use that Content (including to reproduce, distribute, modify, display and perform it) for the purpose of operating, promoting, and improving the Service.
It says, essentially, that when you upload content to YouTube, YouTube is allowed to do whatever it wants with it, as long as it’s in service of operating, promoting or improving the service that YouTube offers.
Now, again, hypothetically speaking, Google might argue that Gemini – or any other AI it’s working on – is being created to help improve the service that YouTube offers. This would need to be tested in court, but it means that, yes, you’re allowing them to use your videos to train Google Gemini if they choose to do so.
TL;DR – OpenAI can’t, YouTube can
In short, if OpenAI is training its models on YouTube data, OpenAI is breaking the YouTube TOS. If Google is using it to train Gemini, Google/YouTube aren’t – because of what we grant them when we upload.
Either way, the copyright situation doesn’t look like it’s going to get any clearer in the near future. And by the time the lawsuits have been filed and new laws come into effect, it’ll be far too late to do anything about it anyway.
Each of their respective models will have more than enough data to do what it needs to do. We’ll likely see more class action suits. We may even see some tech-heavyweight battles in the courtroom, too.
But if you don’t want your video content used to train AI models, don’t post it to the Internet.
[via The Verge]