DeepSeek, a Chinese AI laboratory, is suspected of using unauthorized outputs from Google's (GOOGL, Financial) Gemini model to train its upgraded inference model, R1-0528. This model, launched recently, has shown impressive performance in various tests but lacks transparency regarding its training data sources.
AI developer Sam Paech highlighted on X Platform that the R1-0528's vocabulary and sentence structure closely resemble Google's latest Gemini 2.5 Pro, suggesting potential misuse of Gemini's outputs. Similarly, an anonymous developer and founder of SpeechMap noted that DeepSeek’s model reasoning traces read similarly to content produced by Gemini, questioning the data's originality.
This is not the first accusation against DeepSeek. In December 2023, its V3 model had aroused suspicions of using OpenAI's chat records due to frequent self-references to ChatGPT. OpenAI disclosed that DeepSeek might have used data distillation, an AI development technique, to extract data from stronger language models to train its AI, breaching OpenAI's terms. Microsoft (MSFT), an OpenAI partner, detected significant data leaks linked to DeepSeek by the end of 2024.
Efforts are being made to enhance data protection, with OpenAI and Google implementing measures to safeguard against unauthorized data usage. However, AI's widespread online content has made filtering training data increasingly challenging.