Tuesday , 8 October 2024
Home AI: Technology, News & Trends US AI Giants Accused of Secretly “Seizing” Data

US AI Giants Accused of Secretly “Seizing” Data

196
ChatGPT openAI

The rapid development of artificial intelligence (AI) relies heavily on training models. However, the scarcity of high-quality data and the closed data ecosystems in some fields appear to be hindering AI development.

According to reports from several foreign media outlets, companies like OpenAI, Google, and Meta are seeking online information to train their latest AI systems. However, they are disregarding established policies, deliberately changing rules, and attempting to circumvent copyright laws.

Shortcutting Data Collection

A recent article in The Times of London pointed out that tech giants have been “cutting corners” to collect training data for their AI systems. OpenAI developed a speech recognition tool called Whisper, which transcribes audio files from YouTube videos into plain text documents, thereby creating a source of conversational data to train its next-generation text-based GPT-4 algorithm.

Business Insider reported that YouTube explicitly prohibits applications “independent” of its platform from using its video content. However, OpenAI’s data collection was not accidental.

In fact, OpenAI employees knew that this action would step into a legal gray area. OpenAI President Greg Brockman even personally participated in the collection of videos used. However, OpenAI still considered it reasonable and ultimately obtained over a million hours of transcribed videos.

The biggest mystery is how OpenAI accessed enough YouTube videos to complete this work.

When asked if the company used YouTube videos to train Sora, OpenAI’s Chief Technology Officer, Mira Murati, said she was not sure. When asked again about the source of the training data, she declined to disclose details.

The New York Times reported that, like OpenAI, Google also transcribes YouTube videos to collect text for its AI models, potentially infringing on the copyrights of video creators. Last year, Google also changed its terms of service. The motivation behind this move is clear: it allows AI to train on data from publicly available documents in Google Docs and other materials such as restaurant reviews uploaded to Google Maps.

Facing “Data Bottlenecks”

For tech companies, vast amounts of data are the core nutrients for generative AI and the battleground for the development of large models. Only with sufficient data can technology generate text, images, sounds, and videos similar to human creations in real-time, achieving system innovation.

However, as AI develops, the scarcity of existing internet information, the lack of high-quality textual data, and the monopoly of high-quality data by tech giants may lead to “nutrient deficiencies” in AI. Even though Google and Meta have billions of users generating search queries and social media posts every day, these data are largely restricted by privacy laws and their own policies, preventing AI from leveraging this content.

These tech companies seem to be in dire straits. According to the AI research firm Epoch, tech companies are expected to exhaust high-quality data on the internet as soon as 2026. These companies are using data faster than it is being generated.

Meta also faces limitations in the availability of training data. The company plans to take some measures, such as paying for book licenses or even directly acquiring a large publishing house. Meta has also made privacy-centered changes, so the way it uses consumer data is evidently restricted.

In the face of the data shortage, many companies are even trying to “feed” AI with AI. Companies including Microsoft and OpenAI are feeding the results generated by large models, also known as “synthetic data,” to smaller models. However, some research suggests that synthetic data will eventually lead AI to “eat its own tail.”

Facing Multiple Lawsuits for Copyright Infringement

Last year, The New York Times sued OpenAI and Microsoft, claiming that they used copyrighted news articles to train AI chatbots without permission. OpenAI and Microsoft responded that this was “fair use,” or permissible under copyright law, as they had transformed the works for different purposes.

Last year, over 10,000 trade groups, authors, companies, and others submitted opinions to the US Copyright Office regarding the use of creative works by AI models.

The rapid rise of generative AI has sparked a global competition for high-quality data. However, in this new field, there are no clear regulations on what is legal or ethical.

Business Insider reported that currently, Google, OpenAI, and other tech companies are arguing that using copyrighted content for AI model training is legal, but regulators and courts have not yet ruled on this matter.

Justin Bettman, a US filmmaker, former actor, and writer, told the Copyright Office that AI models obtained the content of his works without permission or payment. She described it as “the biggest theft in America.”

Related Articles

AI cost 1

How to Reduce Costs to Make AI More Accessible?

Ten years ago, developing DigiOps and AI was only affordable for large...

AI-Generated Virtual Worlds

The Rise of AI-Generated Virtual Worlds: Shaping the Future of Digital Experiences

Artificial intelligence (AI) is no longer confined to simple applications like voice...

Future of Technology

The Future of Technology: Trends Shaping Our World in 2024 and Beyond

In 2024, the world of technology is advancing at a pace previously...

Search tool 1

Top 5 Reverse Video Search Tools for Getting Accurate Results

Have you ever stared at a video and wondered who originally posted...