Beyond OpenAI: How DeepSeek R1 and R1-Zero Are Changing AI
Shift from Supervised Learning to Reinforcement Learning Driven Reasoning
Traditional LLMs (e.g., OpenAI): Rely on supervised fine-tuning (SFT) and Proximal Policy Optimization (PPO) for alignment.
- DeepSeek-R1-Zero: Pioneers Reinforcement Learning (RL) on a base model without SFT, using GRPO (Generalized Reinforced Policy Optimization) instead of PPO.
- Breakthrough: Eliminates dependency on human-labeled data via rule-based rewards, focusing on reasoning accuracy and language consistency.
You can also watch the full explanation in my YouTube video
DeepSeek-R1-Zero: Strengths & Challenges
Reinforcement Learning on Base Model without Supervised Fine Tuning (SFT), allowing the model to do reasoning and re-evaluation on its own.
Advancements
- Anthropomorphic “Rethinking”: The reinforcement learning model dynamically revises outputs in a human-like tone, mimicking intuitive reasoning.
- GRPO Optimization: Outperforms PPO (used in traditional LLMs) in scalability and computational efficiency, enabling faster convergence.
Key Challenges
- Poor Readability: Outputs lacked formatting (e.g., markdown) and often mixed languages.
- Generalization Limits: Struggled with user-friendly CoT (Chain of Thought) and multi-domain tasks.
DeepSeek-R1: Addressing Limitations via Multi-Stage Training
1. Cold-Start CoT Data
- Purpose: Seed the model with high-quality, structured reasoning data to improve readability and reasoning.
- Format: XML-like structured outputs (
|special_token|
) separating CoT reasoning from concise summaries. - Human Priors: Filtered non-readable responses and added summaries to enhance clarity.
- Impact: Better performance than R1-Zero, serving as a foundation for iterative RL.
2. Reason-Oriented Reinforcement Learning with GRPO
- Process:
- Fine-Tuning: DeepSeek-V3-Base trained on Cold-Start CoT data.
- RL Training: Large-scale GRPO applied to refine reasoning, combining task accuracy and language consistency rewards.
- Outcome: Enhanced reasoning capabilities while maintaining structured, readable outputs.
3. Rejection Sampling & Supervised Fine-Tuning (SFT)
- Rejection Sampling: Rank RL-generated outputs to curate high-quality responses and pass to SFT.
- SFT Phase: Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks
- Goal: Improve versatility and user-friendliness across tasks.
4. Reinforcement Learning for Generalization
- Final RL Tuning: Applied to the SFT model to enhance helpfulness, harmlessness, and reasoning robustness across all scenarios.
Overall, below is how the multi-staging DeepSeek-R1 pipeline would like:
- Stage 1: SFT on cold-start data to teach readability.
- Stage 2: RL with GRPO to refine reasoning + language rewards.
- Stage 3: Rejection sampling + SFT on multi-domain data (writing, role-play).
- Stage 4: Final RL tuning for generalization.
The synthetic data generated through rejection sampling was merged with supervised data from DeepSeek-V3-Base, combining high-quality model outputs with diverse domain-specific knowledge for enhanced fine-tuning.
Why 2 times reinforcement learning steps?
First is for reasoning capability and second is for capturing human preferences (e.g., helpfulness, harmlessness, coherence).
Knowledge Distillation: Scaling to Smaller Models
- Method: Fine-tuned open-source models (Qwen, Llama) using 800K curated samples from DeepSeek-R1 model.
- Result: Efficient smaller models inherit R1’s reasoning capabilities, democratizing advanced AI.
Conclusion: The DeepSeek-R1 Paradigm
- From Zero to R1: Transitioned from rule-based RL (R1-Zero) to a hybrid framework (Cold-Start + SFT + RL).
- Balancing Strengths: Combines structured reasoning (CoT), human-like readability, and multi-domain adaptability.
- Future-Proof Design: Iterative training and distillation enable scalable, user-centric AI systems.
Layman Explanation : Imagine two chefs learning to cook:
DeepSeek-R1-Zero
- How they learn: This chef never reads recipes or watches cooking shows. Instead, they experiment endlessly in the kitchen, mixing random ingredients, tasting, and adjusting based on what works.
- Strengths: They might invent wildly creative dishes (like “chocolate-covered pickles”) that no one else would think of. They’re great at solving problems in unconventional ways.
- Weaknesses: Their dishes might sometimes be bizarre or miss the mark for human tastes (like forgetting salt exists).
DeepSeek-R1
- How they learn: This chef studies thousands of cookbooks, watches cooking tutorials, and learns from feedback like “this is too spicy” or “add more cheese.”
- Strengths: Their meals align with what humans generally enjoy (like a perfect pizza). They’re reliable for tasks requiring common sense or cultural knowledge.
Terms:
GRPO: Generalized Reinforced Policy Optimization
SFT : Supervised Fine Tuning
CoT: Chain of Thought
RL: Reinforcement Learning
PPO : Proximal Policy Optimization