Beyond OpenAI: How DeepSeek R1 and R1-Zero Are Changing AI

4 min readFeb 8, 2025

Shift from Supervised Learning to Reinforcement Learning Driven Reasoning

Traditional LLMs (e.g., OpenAI): Rely on supervised fine-tuning (SFT) and Proximal Policy Optimization (PPO) for alignment.

DeepSeek-R1-Zero: Pioneers Reinforcement Learning (RL) on a base model without SFT, using GRPO (Generalized Reinforced Policy Optimization) instead of PPO.
Breakthrough: Eliminates dependency on human-labeled data via rule-based rewards, focusing on reasoning accuracy and language consistency.

You can also watch the full explanation in my YouTube video

DeepSeek-R1-Zero: Strengths & Challenges

Reinforcement Learning on Base Model without Supervised Fine Tuning (SFT), allowing the model to do reasoning and re-evaluation on its own.

Advancements

Anthropomorphic “Rethinking”: The reinforcement learning model dynamically revises outputs in a human-like tone, mimicking intuitive reasoning.
GRPO Optimization: Outperforms PPO (used in traditional LLMs) in scalability and computational efficiency, enabling faster convergence.

Key Challenges

Poor Readability: Outputs lacked formatting (e.g., markdown) and often mixed languages.
Generalization Limits: Struggled with user-friendly CoT (Chain of Thought) and multi-domain tasks.

DeepSeek-R1: Addressing Limitations via Multi-Stage Training

1. Cold-Start CoT Data

Purpose: Seed the model with high-quality, structured reasoning data to improve readability and reasoning.
Format: XML-like structured outputs (|special_token|) separating CoT reasoning from concise summaries.
Human Priors: Filtered non-readable responses and added summaries to enhance clarity.
Impact: Better performance than R1-Zero, serving as a foundation for iterative RL.

2. Reason-Oriented Reinforcement Learning with GRPO

Process:

Fine-Tuning: DeepSeek-V3-Base trained on Cold-Start CoT data.
RL Training: Large-scale GRPO applied to refine reasoning, combining task accuracy and language consistency rewards.

Outcome: Enhanced reasoning capabilities while maintaining structured, readable outputs.

3. Rejection Sampling & Supervised Fine-Tuning (SFT)

Rejection Sampling: Rank RL-generated outputs to curate high-quality responses and pass to SFT.
SFT Phase: Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks
Goal: Improve versatility and user-friendliness across tasks.

4. Reinforcement Learning for Generalization

Final RL Tuning: Applied to the SFT model to enhance helpfulness, harmlessness, and reasoning robustness across all scenarios.

Overall, below is how the multi-staging DeepSeek-R1 pipeline would like:

Stage 1: SFT on cold-start data to teach readability.
Stage 2: RL with GRPO to refine reasoning + language rewards.
Stage 3: Rejection sampling + SFT on multi-domain data (writing, role-play).
Stage 4: Final RL tuning for generalization.

The synthetic data generated through rejection sampling was merged with supervised data from DeepSeek-V3-Base, combining high-quality model outputs with diverse domain-specific knowledge for enhanced fine-tuning.

Why 2 times reinforcement learning steps?
First is for reasoning capability and second is for capturing human preferences (e.g., helpfulness, harmlessness, coherence).

Knowledge Distillation: Scaling to Smaller Models

Method: Fine-tuned open-source models (Qwen, Llama) using 800K curated samples from DeepSeek-R1 model.
Result: Efficient smaller models inherit R1’s reasoning capabilities, democratizing advanced AI.

Conclusion: The DeepSeek-R1 Paradigm

From Zero to R1: Transitioned from rule-based RL (R1-Zero) to a hybrid framework (Cold-Start + SFT + RL).
Balancing Strengths: Combines structured reasoning (CoT), human-like readability, and multi-domain adaptability.
Future-Proof Design: Iterative training and distillation enable scalable, user-centric AI systems.

Layman Explanation : Imagine two chefs learning to cook:

DeepSeek-R1-Zero

How they learn: This chef never reads recipes or watches cooking shows. Instead, they experiment endlessly in the kitchen, mixing random ingredients, tasting, and adjusting based on what works.
Strengths: They might invent wildly creative dishes (like “chocolate-covered pickles”) that no one else would think of. They’re great at solving problems in unconventional ways.
Weaknesses: Their dishes might sometimes be bizarre or miss the mark for human tastes (like forgetting salt exists).

DeepSeek-R1

How they learn: This chef studies thousands of cookbooks, watches cooking tutorials, and learns from feedback like “this is too spicy” or “add more cheese.”
Strengths: Their meals align with what humans generally enjoy (like a perfect pizza). They’re reliable for tasks requiring common sense or cultural knowledge.

Terms:

GRPO: Generalized Reinforced Policy Optimization

SFT : Supervised Fine Tuning

CoT: Chain of Thought

RL: Reinforcement Learning

PPO : Proximal Policy Optimization

References

https://arxiv.org/pdf/2501.12948

https://www.vellum.ai/blog/the-training-of-deepseek-r1-and-ways-to-use-it