Saturday , 25 May 2024
Home AI: Technology, News & Trends US AI Giants Accused of Secretly “Seizing” Data

US AI Giants Accused of Secretly “Seizing” Data

ChatGPT openAI

The rapid development of artificial intelligence (AI) relies heavily on training models. However, the scarcity of high-quality data and the closed data ecosystems in some fields appear to be hindering AI development.

According to reports from several foreign media outlets, companies like OpenAI, Google, and Meta are seeking online information to train their latest AI systems. However, they are disregarding established policies, deliberately changing rules, and attempting to circumvent copyright laws.

Shortcutting Data Collection

A recent article in The Times of London pointed out that tech giants have been “cutting corners” to collect training data for their AI systems. OpenAI developed a speech recognition tool called Whisper, which transcribes audio files from YouTube videos into plain text documents, thereby creating a source of conversational data to train its next-generation text-based GPT-4 algorithm.

Business Insider reported that YouTube explicitly prohibits applications “independent” of its platform from using its video content. However, OpenAI’s data collection was not accidental.

In fact, OpenAI employees knew that this action would step into a legal gray area. OpenAI President Greg Brockman even personally participated in the collection of videos used. However, OpenAI still considered it reasonable and ultimately obtained over a million hours of transcribed videos.

The biggest mystery is how OpenAI accessed enough YouTube videos to complete this work.

When asked if the company used YouTube videos to train Sora, OpenAI’s Chief Technology Officer, Mira Murati, said she was not sure. When asked again about the source of the training data, she declined to disclose details.

The New York Times reported that, like OpenAI, Google also transcribes YouTube videos to collect text for its AI models, potentially infringing on the copyrights of video creators. Last year, Google also changed its terms of service. The motivation behind this move is clear: it allows AI to train on data from publicly available documents in Google Docs and other materials such as restaurant reviews uploaded to Google Maps.

Facing “Data Bottlenecks”

For tech companies, vast amounts of data are the core nutrients for generative AI and the battleground for the development of large models. Only with sufficient data can technology generate text, images, sounds, and videos similar to human creations in real-time, achieving system innovation.

However, as AI develops, the scarcity of existing internet information, the lack of high-quality textual data, and the monopoly of high-quality data by tech giants may lead to “nutrient deficiencies” in AI. Even though Google and Meta have billions of users generating search queries and social media posts every day, these data are largely restricted by privacy laws and their own policies, preventing AI from leveraging this content.

These tech companies seem to be in dire straits. According to the AI research firm Epoch, tech companies are expected to exhaust high-quality data on the internet as soon as 2026. These companies are using data faster than it is being generated.

Meta also faces limitations in the availability of training data. The company plans to take some measures, such as paying for book licenses or even directly acquiring a large publishing house. Meta has also made privacy-centered changes, so the way it uses consumer data is evidently restricted.

In the face of the data shortage, many companies are even trying to “feed” AI with AI. Companies including Microsoft and OpenAI are feeding the results generated by large models, also known as “synthetic data,” to smaller models. However, some research suggests that synthetic data will eventually lead AI to “eat its own tail.”

Facing Multiple Lawsuits for Copyright Infringement

Last year, The New York Times sued OpenAI and Microsoft, claiming that they used copyrighted news articles to train AI chatbots without permission. OpenAI and Microsoft responded that this was “fair use,” or permissible under copyright law, as they had transformed the works for different purposes.

Last year, over 10,000 trade groups, authors, companies, and others submitted opinions to the US Copyright Office regarding the use of creative works by AI models.

The rapid rise of generative AI has sparked a global competition for high-quality data. However, in this new field, there are no clear regulations on what is legal or ethical.

Business Insider reported that currently, Google, OpenAI, and other tech companies are arguing that using copyrighted content for AI model training is legal, but regulators and courts have not yet ruled on this matter.

Justin Bettman, a US filmmaker, former actor, and writer, told the Copyright Office that AI models obtained the content of his works without permission or payment. She described it as “the biggest theft in America.”

Related Articles

AI deception

Beware of the Deceptive Evolution of Artificial Intelligence

An article in the field of artificial intelligence (AI) has caused a...


GPT-4o: OpenAI’s Super Gateway, Challenging Google?

Based on ChatGPT or GPT-4o, the way humans obtain information may likely...

What is openai's new product

Speculation: Multimodal AI Assistant, Google vs. OpenAI: What’s Behind the Mystery Product?

Google’s and OpenAI’s mystery new product, slated to be revealed just a...

DeepMind Alphafold 3

Milestone Breakthrough: Google’s DeepMind Unveils New Drug Development AI Model AlphaFold 3

On Wednesday, Google’s DeepMind unveiled the AlphaFold 3, a new model for...