Meta’s lagging AI efforts are making news again. Microsoft CEO Satya Nadella recently admitted that OpenAI had a 2-year runway in the AI race to work uncontested and build ChatGPT. While other top AI labs, such as Anthropic and Google, are swiftly picking up the slack, Meta is seemingly having a long day at the office trying to keep up.
According to internal communications within Meta Inc. during a major copyright lawsuit battle, the company allegedly used copyrighted content to train its AI models and seemingly tried to cover its tracks to avoid copyright infringement-related issues (via The Verge).
Interestingly, the company’s deceitful tactics aimed to expedite the process of catching up with OpenAI’s rapid progression in the AI landscape. An email sent to Meta AI researcher Hugo Touvron by the company’s VP of gen AI revealed the company’s “needs to be GPT4,” which would involve learning “how to build frontier and win this race.”
However, intricated details about the Facebook maker’s plans to achieve these goals reportedly involved the book piracy site Library Genesis (LibGen), which would be used to train its models.
The Verge’s damning report further revealed another email from Meta’s Director of Product, Sony Theakanath, to Joelle Pineau, VP of AI Research, seeking clarity on whether to use LibGen’s data internally for benchmarks included in a blog post or use the site’s data to train a model. In the email, Theakanath indicated Gen AI had been approved to use LibGen for Llama3 but with several mitigations, including scrapping data labeled as pirated or stolen without indicating that the model was trained using data from the site.
According to Theakanath, “Libgen is essential to meet SOTA [state-of-the-art] numbers.” He further indicated that “it is known that OpenAI and Mistral are using the library for their models (through word of mouth)” after escalating the issue to an executive within the organization under MZ, presumably Meta CEO Mark Zuckerberg.
The email also highlighted potential policy risks caused by training the AI models using copyrighted content, including regulatory response and intervention measures following media coverage, highlighting Meta’s copyright infringement practices. “This may undermine our negotiating position with regulators on these issues,” added Theakanath.
Meta reportedly turned to crafty measures to cover its tracks after using LibGen’s data to train its AI models, including removing copyright headers and document identifiers such as the copyright symbol. The document also disclosed comments by employees to further blur the lines, including scrapping metadata “to avoid potential legal complications.”
Copyright infringement is seemingly crucial for AI model training
Microsoft and OpenAI have been wrapped up in countless copyright infringement lawsuits. And while some of these cases are still in court, OpenAI CEO Sam Altman admitted that training AI models without copyrighted content is virtually impossible. He further indicated that almost everything on the internet is copyrighted, deeming the use of copyrighted content to train AI models as fair use. He argued the copyright law doesn’t categorically prohibit training of AI models using copyrighted content.
More recently, reports indicated that top AI labs, including OpenAI and Anthropic, are struggling to develop advanced AI systems due to a lack of high-quality content. However, leaders in the AI landscape, including Sam Altman and the former Google CEO, have disputed the claims, citing no evidence showing scaling laws have begun; “there’s no wall.”