← Writing ← 写作

The GRPO Family in 5 Minutes GRPO 家族 5 分钟通关

In the past 18 months, the GRPO family has multiplied: DAPO, Dr.GRPO, RLOO, GSPO, GiGPO. Each is a 1–2 line patch to GRPO — and each attacks a different facet of the same underlying problem: importance sampling variance. Here’s the field guide I wish I had.

过去 18 个月里,GRPO 家族裂变出 DAPO、Dr.GRPO、RLOO、GSPO、GiGPO。 每个都只是对 GRPO 的 1–2 行补丁 — 但各自打的是同一个根本问题 (importance sampling 的方差)的不同面。下面是我希望早一点看到的对照表。

The substrate everyone shares 每个变体都继承的根

Every method in this family inherits PPO’s clipped objective:

整个家族都继承自 PPO 的截断目标函数:

$$L^{\text{CLIP}} = \mathbb{E}\!\left[\min\!\left(\rho_t A_t,\ \text{clip}(\rho_t, 1 \pm \epsilon)\, A_t\right)\right]$$

Two quantities are free to vary across variants:

两个会随变体而变的量:

The whole family is engineering trade-offs on these two axes.

整个家族就是在这两个轴上做工程权衡。

Family tree 家族谱系

PPO  (4 networks: policy / value / ref / RM)
 └─ GRPO    replace V(s) with intra-group mean/std  →  3 networks
     │
     ├─ Dr.GRPO    drop  /|o_i|  and  /std         →  unbiased
     ├─ DAPO       clip-higher + dynamic sampling
     │             + token-level loss + overlong shaping
     ├─ RLOO       drop clip; LOO baseline only    →  REINFORCE++
     ├─ GSPO       token-level ratio  →  sequence-level
     └─ GiGPO      + step-level anchor groups       (multi-turn agents)
            └─ ESSA  error fingerprint              (high-entropy states)

Each variant in one paragraph 每个变体一段话讲清

GRPO (DeepSeekMath, 2024)

The "drop the critic" move. Replace the learned value function $V(s)$ with the intra-group reward mean and std: $A_i = (R_i - \bar{R}) / \sigma_R$. Saves the value head’s 1× LLM of memory. The whole point of group-based RL is that for sparse, sequence-level rewards (math/code verifiers), a learned critic is hard to train anyway.
Use when: standard RLHF or single-turn reasoning, group size $N \geq 4$.

"去掉 critic"的关键一步。把学习的 value function $V(s)$ 替换为组内 reward 的均值 和标准差:$A_i = (R_i - \bar{R}) / \sigma_R$。省掉 value head 那 1× LLM 显存。group-based RL 的精髓在于:对稀疏 + sequence 级 reward (数学 / code verifier),critic 本来就难训。
用在:标准 RLHF / 单轮推理,group size $N \geq 4$。

Dr.GRPO (2025)

Two unbiased fixes you should apply by default:

两个无偏修正,默认就该开:

Net change: ~5 lines of code. Use when: always.

净改动:约 5 行代码。用在:任何场景,默认。

DAPO (ByteDance, 2025)

Four orthogonal tricks bundled together:

4 个正交 trick 打包:

  1. Clip-higher: asymmetric clip with $\epsilon_\text{high} > \epsilon_\text{low}$ (e.g. 0.28 vs 0.2). Prevents entropy collapse by letting low-probability exploration tokens get amplified.
  2. Clip-higher:非对称 clip,$\epsilon_\text{high} > \epsilon_\text{low}$(如 0.28 vs 0.2)。允许低概率探索 token 被放大,防熵塌缩。
  3. Dynamic sampling: skip groups where all rollouts got the same reward ($\sigma=0$ batches). They contribute zero gradient, only waste compute.
  4. Dynamic sampling:跳过组内 reward 全相同($\sigma=0$)的 batch。零梯度,纯浪费算力。
  5. Token-level loss: sum per-token loss across the whole group, not per-sample mean. Fixes the same bias Dr.GRPO targets, from a different angle.
  6. Token-level loss:整组 per-token loss 求和,不是 per-sample mean。从另一个角度修同样的偏差。
  7. Overlong reward shaping: soft penalty for truncated responses; don’t pretend a hard cap is a real reward signal.
  8. Overlong reward shaping:截断回复软惩罚;不要把硬截断当 reward 信号用。

Use when: training is unstable, entropy is collapsing, or $\sigma=0$ batches dominate.

用在:训练不稳、熵塌缩、$\sigma=0$ batch 太多的场景。

RLOO (Cohere, 2024)

Cut even further: drop the clip itself, use REINFORCE with a leave-one-out baseline. The empirical observation: in RLHF the clip almost never fires ($< 5\%$ of updates), so why pay for the complexity?
Use when: short-response RLHF, flat reward distribution, want maximum simplicity. Less reliable on long-CoT.

更激进:连 clip 都去掉,用 REINFORCE + leave-one-out baseline。实证发现: RLHF 场景下 clip 几乎从不触发($< 5\%$),那为什么还付那套复杂度?
用在:短回复 RLHF,reward 分布平坦,追求极简。长 CoT 上没那么稳。

GSPO (Qwen, 2025)

Move the importance ratio from token-level to sequence-level:

把 importance ratio 从 token 级提到 sequence 级:

$$s_i(\theta) = \left(\prod_{t=1}^{|o_i|} \frac{\pi_\theta(o_t|s_t)}{\pi_{\theta_\text{old}}(o_t|s_t)}\right)^{\!1/|o_i|}$$

The intuition is small but deep: the optimization unit should match the reward unit. When reward is given at the sequence level, correcting off-policy drift at the token level introduces variance that doesn’t correspond to any real signal. GSPO is also stable for MoE models, where per-token routing varies between policy snapshots and inflates token-level ratios.
Use when: long sequences ($>$ 1K token), MoE training, anywhere token-level ratio variance is hurting you.

洞察很小但很深:优化单元应该和 reward 单元对齐。reward 是 sequence 级时,在 token 级做 off-policy 修正引入的方差并不对应任何真实信号。GSPO 在 MoE 模型上也稳 — MoE 的 per-token 路由在策略快照间会漂,把 token-level ratio 抬高。
用在:长序列($>$ 1K token)、MoE 训练,以及任何 token-level 方差变成训练噪音的场景。

GiGPO (NeurIPS 2025)

For multi-turn agents, trajectory-level advantage is too coarse — it can’t distinguish between "good early action, bad late action" and the reverse. GiGPO adds a second, finer group: for actions taken from the same anchor state across different rollouts, compute an extra step-level advantage. Total advantage is the sum.

多轮 agent 场景下,trajectory 级 advantage 太粗 — 无法区分"早期好动作 + 晚期坏动作"和反过来。GiGPO 加了一层更细的 group:对同一个锚状态 在不同 rollout 中采取的动作,再算一个 step 级 advantage,总 advantage 求和。

The trick works when state is revisitable (the same web page, the same room in a navigation task). It degenerates to GRPO when state space is high-entropy — which is exactly the gap my Compiler-as-Reward work addresses with error-fingerprint anchor states (ESSA).
Use when: multi-turn agent RL with revisitable states.

关键前提是状态可重访(同一个网页、导航任务里同一个房间)。状态空间高熵时它退化为 GRPO — 这正是我的 Compiler-as-Reward 工作用错误指纹锚状态(ESSA)解决的缺口。
用在:多轮 agent RL + 状态可重访。

Picking one (cheat sheet) 选型速查表

Scenario场景 Recommended推荐
Math / code, single-turn数学 / code 单轮 GRPO + Dr.GRPO + DAPO
Short-response RLHF短回复 RLHF RLOO + Dr.GRPO
Long-CoT reasoning长 CoT 推理 GRPO + GSPO + Dr.GRPO
MoE trainingMoE 模型训练 GSPO
Multi-turn agent (revisitable states)多轮 agent (状态可重访) GiGPO + Dr.GRPO
Multi-turn agent (high-entropy states)多轮 agent (状态高熵) GiGPO + ESSA-style fingerprint
Current SOTA stack当下 SOTA 组合 GSPO-token + GiGPO + Dr.GRPO

The bigger insight 更深的洞察

Why this proliferation isn’t random 为什么这堆变体不是噪声
All of these patches are pulling on a single lever: reducing the variance of importance sampling estimation. PPO’s clip is one variance control. Dr.GRPO removes systematic biases that masquerade as variance. DAPO controls variance via asymmetric clip plus sample filtering. GSPO reduces variance at the source by switching the optimization unit. GiGPO doesn’t touch IS at all — it adds a second, finer advantage signal that lets each unit of IS correction do more work.
所有这些补丁都在拉同一个杠杆:降低 importance sampling 估计的方差。 PPO clip 是一种方差控制。Dr.GRPO 去掉伪装成方差的系统性偏差。DAPO 用非对称 clip + 样本过滤控方差。GSPO 改变优化单元从源头降方差。GiGPO 不动 IS — 但加了一个更细的 advantage 信号,让每单位的 IS 修正能做更多事。

Once you see this, you can almost predict the next three variants. Someone will redefine the optimization unit again (action-level? thought-level?). Someone will tune the asymmetric clip further (per-token entropy-aware?). Someone will subdivide credit assignment one more level (sub-action? sub-thought?). It’s the same equation each time, with one more knob attached.

看懂这个,你几乎能预测接下来 3 个变体:有人会再重新定义优化单元 (action 级?thought 级?)、有人会继续调非对称 clip(逐 token 熵感知?)、 有人会再细分一层信用分配(sub-action?sub-thought?)。每次都是同一个公式, 多挂一个旋钮。

What to read next 延伸阅读