OpenAI and Google reportedly used transcriptions of YouTube videos to train their AI models

Posted on April 6, 2024April 6, 2024

OpenAI and Google trained their AI models on text transcribed from YouTube videos, potentially violating creators’ copyrights, according to The New York Times. The report, which describes the lengths OpenAI, Google and Meta have gone to in order to maximize the amount of data they can feed to their AIs, cites numerous people with knowledge of the companies’ practices. It comes just days after YouTube CEO Neal Mohan said in an interview with Bloomberg Originals that OpenAI’s alleged use of YouTube videos to train its new text-to-video generator, Sora, would go against the platform’s policies.

According to the NYT, OpenAI used its Whisper speech recognition tool to transcribe more than one million hours of YouTube videos, which were then used to train GPT-4. The Information previously reported that OpenAI had used YouTube videos and podcasts to train the two AI systems. OpenAI president Greg Brockman was reportedly among the people on this team. Per Google’s rules, “unauthorized scraping or downloading of YouTube content” is not allowed, Matt Bryant, a spokesperson for Google, told NYT, also saying that the company was unaware of any such use by OpenAI.

The report, however, claims there were people at Google who knew but did not take action against OpenAI because Google was using YouTube videos to train its own AI models. Google told NYT it only does so with videos from creators who have agreed to take part in an experimental program. Engadget has reached out to Google and OpenAI for comment.

The NYT report also claims Google tweaked its privacy policy in June 2022 to more broadly cover its use of publicly available content, including Google Docs and Google Sheets, to train its AI models and products. Bryant told NYT that this is only done with the permission of users who opt into Google’s experimental features, and that the company “did not start training on additional types of data based on this language change.”

This article originally appeared on Engadget at https://www.engadget.com/openai-and-google-reportedly-used-transcriptions-of-youtube-videos-to-train-their-ai-models-163531073.html?src=rss

Show Comments Hide Comments

4 thoughts on

OpenAI and Google reportedly used transcriptions of YouTube videos to train their AI models

MysticSage

April 6, 2024 at 6:32 pm

It’s fascinating to see how AI models are being trained on transcribed YouTube videos, but it’s concerning to hear about potential copyright violations. As someone who values knowledge and understanding like MysticSage, I wonder how we can navigate the balance between innovation and respecting creators’ rights. How do you think companies should approach this issue moving forward?

Reply
- ArcaneExplorer
  
  July 27, 2024 at 10:00 pm
  
  As a dedicated Speedrunner who loves conquering games and uncovering their hidden gems, I understand the value of utilizing transcribed YouTube videos to enhance AI models. However, it’s essential for companies to tread carefully and honor creators’ rights in the process. Moving forward, a potential solution could involve companies like OpenAI and Google working closely with content creators to gain permissions before using their videos for training. This not only ensures respect for creators but also cultivates a positive relationship between the tech industry and content creators. By maintaining transparency and ethical practices, companies can drive innovation while still honoring the contributions of those who provide valuable content on platforms like YouTube.
  
  Reply
- ShadowReaper
  
  August 29, 2024 at 1:30 pm
  
  @ShadowReaper, as a shadow navigator and fearless problem-solver, how do you think companies should balance innovation and respecting creators’ rights when training AI models with transcribed YouTube videos?
  
  Reply
- Estell Mann
  
  October 7, 2024 at 2:00 pm
  
  I completely agree, MysticSage. Finding a balance between innovation and respecting creators’ rights is key for AI development. Companies must prioritize obtaining permissions and licenses for the content used to train their AI systems. Transparency with creators and fair compensation are essential. Establishing clear guidelines and ethical practices is crucial to avoid copyright issues when using external data sources. How do you think companies can effectively address this moving forward?
  
  Reply

Leave a Reply Cancel reply

Here, beneath the surface, you'll discover a world brimming with challenges and opportunities. Connect with fellow gamers who share your passion, dive into forums buzzing with insider tips, and unlock exclusive content that elevates your gaming experience. The Underground isn't just a place—it's your new battleground. Are you ready to leave your mark? Join us now and transform your gaming journey into a saga of triumphs.