Beyond OpenAI: How DeepSeek R1 and R1-Zero Are Changing AI

Chetna Shahi
4 min readFeb 8, 2025

--

Shift from Supervised Learning to Reinforcement Learning Driven Reasoning

OpenAI Language Model Architecture

Traditional LLMs (e.g., OpenAI): Rely on supervised fine-tuning (SFT) and Proximal Policy Optimization (PPO) for alignment.

  • DeepSeek-R1-Zero: Pioneers Reinforcement Learning (RL) on a base model without SFT, using GRPO (Generalized Reinforced Policy Optimization) instead of PPO.
  • Breakthrough: Eliminates dependency on human-labeled data via rule-based rewards, focusing on reasoning accuracy and language consistency.

You can also watch the full explanation in my YouTube video

OpenAI vs DeepSeek R1

DeepSeek-R1-Zero: Strengths & Challenges

Reinforcement Learning on Base Model without Supervised Fine Tuning (SFT), allowing the model to do reasoning and re-evaluation on its own.

Advancements

  1. Anthropomorphic “Rethinking”: The reinforcement learning model dynamically revises outputs in a human-like tone, mimicking intuitive reasoning.
  2. GRPO Optimization: Outperforms PPO (used in traditional LLMs) in scalability and computational efficiency, enabling faster convergence.

Key Challenges

  1. Poor Readability: Outputs lacked formatting (e.g., markdown) and often mixed languages.
  2. Generalization Limits: Struggled with user-friendly CoT (Chain of Thought) and multi-domain tasks.
DeepSeek-R1-Zero

DeepSeek-R1: Addressing Limitations via Multi-Stage Training

1. Cold-Start CoT Data

  • Purpose: Seed the model with high-quality, structured reasoning data to improve readability and reasoning.
  • Format: XML-like structured outputs (|special_token|) separating CoT reasoning from concise summaries.
  • Human Priors: Filtered non-readable responses and added summaries to enhance clarity.
  • Impact: Better performance than R1-Zero, serving as a foundation for iterative RL.

2. Reason-Oriented Reinforcement Learning with GRPO

  • Process:
  1. Fine-Tuning: DeepSeek-V3-Base trained on Cold-Start CoT data.
  2. RL Training: Large-scale GRPO applied to refine reasoning, combining task accuracy and language consistency rewards.
  • Outcome: Enhanced reasoning capabilities while maintaining structured, readable outputs.

3. Rejection Sampling & Supervised Fine-Tuning (SFT)

  • Rejection Sampling: Rank RL-generated outputs to curate high-quality responses and pass to SFT.
  • SFT Phase: Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks
  • Goal: Improve versatility and user-friendliness across tasks.

4. Reinforcement Learning for Generalization

  • Final RL Tuning: Applied to the SFT model to enhance helpfulness, harmlessness, and reasoning robustness across all scenarios.

Overall, below is how the multi-staging DeepSeek-R1 pipeline would like:

  • Stage 1: SFT on cold-start data to teach readability.
  • Stage 2: RL with GRPO to refine reasoning + language rewards.
  • Stage 3: Rejection sampling + SFT on multi-domain data (writing, role-play).
  • Stage 4: Final RL tuning for generalization.
DeepSeek-R1

The synthetic data generated through rejection sampling was merged with supervised data from DeepSeek-V3-Base, combining high-quality model outputs with diverse domain-specific knowledge for enhanced fine-tuning.

Why 2 times reinforcement learning steps?

First is for reasoning capability and second is for capturing human preferences (e.g., helpfulness, harmlessness, coherence).

Knowledge Distillation: Scaling to Smaller Models

  • Method: Fine-tuned open-source models (Qwen, Llama) using 800K curated samples from DeepSeek-R1 model.
  • Result: Efficient smaller models inherit R1’s reasoning capabilities, democratizing advanced AI.

Conclusion: The DeepSeek-R1 Paradigm

  1. From Zero to R1: Transitioned from rule-based RL (R1-Zero) to a hybrid framework (Cold-Start + SFT + RL).
  2. Balancing Strengths: Combines structured reasoning (CoT), human-like readability, and multi-domain adaptability.
  3. Future-Proof Design: Iterative training and distillation enable scalable, user-centric AI systems.

Layman Explanation : Imagine two chefs learning to cook:

DeepSeek-R1-Zero

  • How they learn: This chef never reads recipes or watches cooking shows. Instead, they experiment endlessly in the kitchen, mixing random ingredients, tasting, and adjusting based on what works.
  • Strengths: They might invent wildly creative dishes (like “chocolate-covered pickles”) that no one else would think of. They’re great at solving problems in unconventional ways.
  • Weaknesses: Their dishes might sometimes be bizarre or miss the mark for human tastes (like forgetting salt exists).

DeepSeek-R1

  • How they learn: This chef studies thousands of cookbooks, watches cooking tutorials, and learns from feedback like “this is too spicy” or “add more cheese.”
  • Strengths: Their meals align with what humans generally enjoy (like a perfect pizza). They’re reliable for tasks requiring common sense or cultural knowledge.

Terms:

GRPO: Generalized Reinforced Policy Optimization

SFT : Supervised Fine Tuning

CoT: Chain of Thought

RL: Reinforcement Learning

PPO : Proximal Policy Optimization

--

--

Responses (1)