Sunday , 21 June 2026

Home AI: Technology, News & Trends US AI Giants Accused of Secretly “Seizing” Data

US AI Giants Accused of Secretly “Seizing” Data

Apr 19, 20243 Mins read781

The rapid development of artificial intelligence (AI) relies heavily on training models. However, the scarcity of high-quality data and the closed data ecosystems in some fields appear to be hindering AI development.

According to reports from several foreign media outlets, companies like OpenAI, Google, and Meta are seeking online information to train their latest AI systems. However, they are disregarding established policies, deliberately changing rules, and attempting to circumvent copyright laws.

Shortcutting Data Collection

A recent article in The Times of London pointed out that tech giants have been “cutting corners” to collect training data for their AI systems. OpenAI developed a speech recognition tool called Whisper, which transcribes audio files from YouTube videos into plain text documents, thereby creating a source of conversational data to train its next-generation text-based GPT-4 algorithm.

Business Insider reported that YouTube explicitly prohibits applications “independent” of its platform from using its video content. However, OpenAI’s data collection was not accidental.

In fact, OpenAI employees knew that this action would step into a legal gray area. OpenAI President Greg Brockman even personally participated in the collection of videos used. However, OpenAI still considered it reasonable and ultimately obtained over a million hours of transcribed videos.

The biggest mystery is how OpenAI accessed enough YouTube videos to complete this work.

When asked if the company used YouTube videos to train Sora, OpenAI’s Chief Technology Officer, Mira Murati, said she was not sure. When asked again about the source of the training data, she declined to disclose details.

The New York Times reported that, like OpenAI, Google also transcribes YouTube videos to collect text for its AI models, potentially infringing on the copyrights of video creators. Last year, Google also changed its terms of service. The motivation behind this move is clear: it allows AI to train on data from publicly available documents in Google Docs and other materials such as restaurant reviews uploaded to Google Maps.

Facing “Data Bottlenecks”

For tech companies, vast amounts of data are the core nutrients for generative AI and the battleground for the development of large models. Only with sufficient data can technology generate text, images, sounds, and videos similar to human creations in real-time, achieving system innovation.

However, as AI develops, the scarcity of existing internet information, the lack of high-quality textual data, and the monopoly of high-quality data by tech giants may lead to “nutrient deficiencies” in AI. Even though Google and Meta have billions of users generating search queries and social media posts every day, these data are largely restricted by privacy laws and their own policies, preventing AI from leveraging this content.

These tech companies seem to be in dire straits. According to the AI research firm Epoch, tech companies are expected to exhaust high-quality data on the internet as soon as 2026. These companies are using data faster than it is being generated.

Meta also faces limitations in the availability of training data. The company plans to take some measures, such as paying for book licenses or even directly acquiring a large publishing house. Meta has also made privacy-centered changes, so the way it uses consumer data is evidently restricted.

In the face of the data shortage, many companies are even trying to “feed” AI with AI. Companies including Microsoft and OpenAI are feeding the results generated by large models, also known as “synthetic data,” to smaller models. However, some research suggests that synthetic data will eventually lead AI to “eat its own tail.”

Facing Multiple Lawsuits for Copyright Infringement

Last year, The New York Times sued OpenAI and Microsoft, claiming that they used copyrighted news articles to train AI chatbots without permission. OpenAI and Microsoft responded that this was “fair use,” or permissible under copyright law, as they had transformed the works for different purposes.

Last year, over 10,000 trade groups, authors, companies, and others submitted opinions to the US Copyright Office regarding the use of creative works by AI models.

The rapid rise of generative AI has sparked a global competition for high-quality data. However, in this new field, there are no clear regulations on what is legal or ethical.

Business Insider reported that currently, Google, OpenAI, and other tech companies are arguing that using copyrighted content for AI model training is legal, but regulators and courts have not yet ruled on this matter.

Justin Bettman, a US filmmaker, former actor, and writer, told the Copyright Office that AI models obtained the content of his works without permission or payment. She described it as “the biggest theft in America.”

Previous post Boston Dynamics Unveils New Electric Atlas Robot

Next post Rapid Expansion of Greece's Photovoltaic Industry

US AI Giants Accused of Secretly “Seizing” Data

Shortcutting Data Collection

Facing “Data Bottlenecks”

Facing Multiple Lawsuits for Copyright Infringement

Recent Posts

Découvrez le monde passionnant de Nine Casinos : votre guide complet

Découvrez comment la technologie révolutionne l’expérience des casinos en ligne

Plongée dans l’univers de Nine Casino : entre curiosité et réalité

Découvrez les Secrets du Mad Casino 23 : Une Expérience de Jeu Unique

Categories

Related Articles

Découvrez le monde passionnant de Nine Casinos : votre guide complet

Découvrez comment la technologie révolutionne l’expérience des casinos en ligne

Plongée dans l’univers de Nine Casino : entre curiosité et réalité

Découvrez les Secrets du Mad Casino 23 : Une Expérience de Jeu Unique

Information

Press

Découvrez le monde passionnant de Nine Casinos : votre guide complet

Découvrez comment la technologie révolutionne l’expérience des casinos en ligne

Subscribe Latest.com