Chinese AI startup DeepSeek burst into the AI scene earlier this year with its ultra-cost-effective, R1 V3-powered AI model. Consequently, it raised concerns among investors, especially after it surpassed OpenAI’s o1 reasoning model across a wide range of benchmarks, including math, science, and coding at a fraction of the cost.
While DeepSeek researchers claimed the company spent approximately $6 million to train its cost-effective model, multiple reports suggest that it cut corners by using Microsoft and OpenAI’s copyrighted content to train its model.
Another report claimed that the Chinese AI startup spent up to $1.6 billion on hardware, including 50,000 NVIDIA Hopper GPUs. OpenAI lodged a complaint, indicating the company used to train its models to train its cost-effective AI model.
The ChatGPT maker claimed DeepSeek used “distillation” to train its R1 model. For context, distillation is the process whereby a company, in this case, DeepSeek leverages preexisting model’s output (OpenAI) to train a new model.
As such, the company reduces the exorbitant amount of money required to develop and train an AI model. And as it now seems, OpenAI’s accusations seemingly hold some water.
A new study by AI detection firm Copyleaks reveals that DeepSeek’s AI-generated outputs are reminiscent of OpenAI’s ChatGPT. Perhaps more concerning, the study’d findings revealed a 74.2% resemblance (via Forbes).
Did DeepSeek train its AI model using OpenAI’s copyrighted content? The tell-tale signs suggest as much
Copyleaks uses screening tech and algorithm classifiers to identify text generate by AI models. For this specific study, the classifiers unanimously voted that DeepSeek’s outputs were generated using OpenAI’s models.
Interestingly, the AI detection firm has used this approach to identify text generated by AI models, including OpenAI, Claude, Gemini, Llama, which it distinguished as unique to each model. Classifiers use unanimous voting as standard practice to reduce false positives.
Shai Nisan, head of data science at Copyleaks indicated:
“Our research utilized a ‘unanimous jury’ approach and identified a strong stylistic similarity between DeepSeek and OpenAI’s models, which wasn’t found with other inspected models.”
While investors had begun raising concern about the large amounts invested in developing and training AI models, the study’s findings raises questions about DeepSeek’s AI model training and development and whether its approach was truly cost-effective.
As highlighted by Nissan:
“While this similarity doesn’t definitively prove or declare DeepSeek as a derivative, it does raise questions about its development. Our research specifically focuses on writing style; within that domain, the similarity to OpenAI is significant. Considering OpenAI’s market lead, our findings suggest that further investigation into DeepSeek’s architecture, training data and development process is necessary.”
What’s next for DeepSeek if found guilty of copyright infringement?
While the study’s findings suggest DeepSeek’s AI-generated texts resemble OpenAI’s ChatGPT by 74.2%, it doesn’t necessarily rule out the AI model as a carbon copy. However, it could brew more trouble for the AI startup, riddling it with IP rights and copyright infringement issues.
And that DeepSeek didn’t categorically indicate that it used OpenAI’s models to train its entry makes the situation worse, with significant legal and financial setbacks.
According to Copyleaks’ Head of Data Science:
“The research strongly suggests that transparency and strong IP protections are paramount in the future of AI development and regulation. Regulators are likely to consider requiring companies to disclose detailed information about the datasets and model outputs used in training their models.”
OpenAI has multiple copyright infringement ghosts in its basement
As you may know, OpenAI and Microsoft are no strangers in the corridors of justice, especially pertaining to copyright infringement issues due to their AI efforts. For instance, eight news publishers filed copyright infringement lawsuits against Microsoft and OpenAI earlier this year in May 2024.
OpenAI CEO Sam Altman argued that copyright law doesn’t categorically prohibit the use of copyrighted content for training AI models. However, the executive admitted developing ChatGPT-like tools without copyrighted content is virtually impossible.
To that end, with the rapid emergence of AI-powered tools, copyright infringement is seemingly trapped in a grey area, making it difficult to establish the fine line when AI firms outrightly steal content from publishers and other internet sources.