DeepSeek’s Latest AI Model Sparks Concerns Over Training Data Sources
Last week, DeepSeek, a prominent Chinese AI laboratory, unveiled an enhanced version of its R1 reasoning model, R1-0528. This model has shown impressive performance on various math and coding benchmarks. However, the company has remained tight-lipped about the data sources used for training, leading to speculation among AI researchers.
Some experts believe that DeepSeek’s model may have been trained using data from Google’s Gemini family of AI. Sam Paech, a developer focused on AI emotional intelligence, presented evidence suggesting that R1-0528 favors similar words and expressions as Gemini 2.5 Pro. Additionally, the model’s “thought traces” during problem-solving resemble those generated by Gemini, further fueling these suspicions.
This isn’t the first time DeepSeek has been accused of using data from rival AI models. In December, observations indicated that DeepSeek’s V3 model often identified itself as ChatGPT, hinting at potential training on OpenAI’s chatbot logs. Earlier this year, OpenAI reported detecting data extraction attempts linked to DeepSeek through their developer accounts, suggesting the use of distillation techniques to train rival models using outputs from larger, more capable models.
While distillation is a common practice in AI development, OpenAI’s terms of service explicitly prohibit using their model outputs to build competing AI systems. The prevalence of AI-generated content on the open web complicates training data integrity, as content farms and bots contribute to a saturated data environment, making it challenging to filter out proprietary model outputs.
AI experts like Nathan Lambert from AI2 believe that DeepSeek likely trained its model using Gemini data, given their substantial computational resources. In response, AI companies are enhancing security measures to prevent such practices. For instance, OpenAI now requires organizations to undergo ID verification to access advanced models, excluding certain regions like China. Similarly, Google and Anthropic have started summarizing model traces to protect their competitive advantages.
As the AI landscape becomes increasingly competitive, ensuring the integrity of training data remains a critical concern. Companies are taking proactive steps to safeguard their models and maintain fair competition in the rapidly evolving field of artificial intelligence.

