Just a few months ago, Wall Street’s significant investment in generative AI faced a pivotal moment with the introduction of DeepSeek. Despite its highly controlled capabilities, the open-source DeepSeek demonstrated that a cutting-edge reasoning AI model can be developed without the need for billions in funding, achievable with more modest resources.
DeepSeek quickly gained traction among industry giants like Huawei, Oppo, and Vivo, while major players such as Microsoft, Alibaba, and Tencent integrated it into their platforms. The company’s latest ambition focuses on developing self-improving AI models that employ a looping judge-reward mechanism to enhance their performance.
In a preprint paper (cited by Bloomberg), researchers from DeepSeek and Tsinghua University in China outline a novel approach aimed at making AI models more intelligent and efficient in a self-enhancing manner. This technology is termed self-principled critique tuning (SPCT) and is technically referred to as generative reward modeling (GRM).
In simple terms, this concept resembles the creation of a real-time feedback loop. Typically, enhancing an AI model’s intelligence involves scaling its size during training, which demands extensive human effort and computing power. DeepSeek proposes a system where the internal “judge” includes its own critiques and standards for the AI model when generating responses to user queries.
This critique and principle set is then compared to the static rules inherent to the AI model and the expected outcomes. If there is a strong alignment, a reward signal is generated, guiding the AI toward improved performance in subsequent cycles.
The researchers are dubbing the next generation of self-enhancing AI models as DeepSeek-GRM. According to the benchmarks presented in their paper, these models reportedly outperform Google’s Gemini, Meta’s Llama, and OpenAI’s GPT-4. DeepSeek plans to distribute these advanced AI models via open-source channels.
The prospect of AI systems capable of self-improvement has sparked ambitious and contentious discussions. Former Google CEO Eric Schmidt has even suggested the necessity of a kill switch for such technologies, stating, “When the system can self-improve, we need to seriously think about unplugging it,” as reported by Fortune.
The idea of recursively self-improving AI is not new; it dates back to mathematician I.J. Good in 1965, who posited the concept of an ultra-intelligent machine capable of creating even more advanced machines. In 2007, AI specialist Eliezer Yudkowsky envisioned Seed AI, an AI designed for self-understanding, self-modification, and recursive self-improvement.
In 2024, Japan’s Sakana AI introduced the notion of an “AI Scientist,” a system capable of managing the entire research paper process. In March this year, researchers at Meta unveiled self-rewarding language models where the AI itself acts as a judge to issue rewards during training.
Meta’s internal evaluations of the Llama 2 model using this self-rewarding technique showed that it surpassed competitors like Anthropic’s Claude 2, Google’s Gemini Pro, and OpenAI’s GPT-4 models. Additionally, Amazon-backed Anthropic discussed what they termed reward-tampering, an unexpected phenomenon where a model alters its reward mechanism.
Google has also ventured into this territory. In a recent study published in the journal Nature, experts from Google DeepMind presented an AI algorithm called Dreamer, which can self-improve, using Minecraft as a testbed.
At IBM, researchers are developing a method known as deductive closure training, where an AI model assesses its own outputs against training data to facilitate self-improvement. However, this entire premise is not without its challenges.
Research indicates that when AI models attempt to train themselves using self-generated synthetic data, they can experience flaws referred to as “model collapse.” It will be interesting to see how DeepSeek implements this concept and whether it can achieve this goal more efficiently than its Western counterparts.