In the past 18 months, the GRPO family has multiplied: DAPO, Dr.GRPO, RLOO, GSPO, GiGPO. Each is a 1–2 line patch to GRPO — and each attacks a different facet of the same underlying problem: importance sampling variance. Here’s the field guide I wish I had.
过去 18 个月里,GRPO 家族裂变出 DAPO、Dr.GRPO、RLOO、GSPO、GiGPO。 每个都只是对 GRPO 的 1–2 行补丁 — 但各自打的是同一个根本问题 (importance sampling 的方差)的不同面。下面是我希望早一点看到的对照表。
The substrate everyone shares 每个变体都继承的根
Every method in this family inherits PPO’s clipped objective:
整个家族都继承自 PPO 的截断目标函数:
$$L^{\text{CLIP}} = \mathbb{E}\!\left[\min\!\left(\rho_t A_t,\ \text{clip}(\rho_t, 1 \pm \epsilon)\, A_t\right)\right]$$Two quantities are free to vary across variants:
两个会随变体而变的量:
- The advantage $A_t$ — how do you estimate "how much better than baseline?"
- The importance ratio $\rho_t$ — at what unit (token, sequence) do you correct for off-policy drift?
- Advantage $A_t$ — "比 baseline 好多少"怎么估计?
- Importance ratio $\rho_t$ — 在哪个粒度(token / sequence)做 off-policy 修正?
The whole family is engineering trade-offs on these two axes.
整个家族就是在这两个轴上做工程权衡。
Family tree 家族谱系
PPO (4 networks: policy / value / ref / RM)
└─ GRPO replace V(s) with intra-group mean/std → 3 networks
│
├─ Dr.GRPO drop /|o_i| and /std → unbiased
├─ DAPO clip-higher + dynamic sampling
│ + token-level loss + overlong shaping
├─ RLOO drop clip; LOO baseline only → REINFORCE++
├─ GSPO token-level ratio → sequence-level
└─ GiGPO + step-level anchor groups (multi-turn agents)
└─ ESSA error fingerprint (high-entropy states)
Each variant in one paragraph 每个变体一段话讲清
GRPO (DeepSeekMath, 2024)
The "drop the critic" move. Replace the learned value function $V(s)$ with the
intra-group reward mean and std: $A_i = (R_i - \bar{R}) / \sigma_R$. Saves the
value head’s 1× LLM of memory. The whole point of group-based RL is that
for sparse, sequence-level rewards (math/code verifiers), a learned critic is
hard to train anyway.
Use when: standard RLHF or single-turn reasoning, group size $N \geq 4$.
"去掉 critic"的关键一步。把学习的 value function $V(s)$ 替换为组内 reward 的均值
和标准差:$A_i = (R_i - \bar{R}) / \sigma_R$。省掉 value head 那 1× LLM
显存。group-based RL 的精髓在于:对稀疏 + sequence 级 reward
(数学 / code verifier),critic 本来就难训。
用在:标准 RLHF / 单轮推理,group size $N \geq 4$。
Dr.GRPO (2025)
Two unbiased fixes you should apply by default:
两个无偏修正,默认就该开:
- Drop the
1/|o_i|token weighting (which silently biases the gradient toward longer responses). - 去掉
1/|o_i|token 加权(它会悄悄把梯度推向更长的回复)。 - Drop the
/stdnormalization (which biases toward harder prompts, where std is smaller). - 去掉
/std归一化(它会把权重偏向更难的 prompt,因为难题 std 小)。
Net change: ~5 lines of code. Use when: always.
净改动:约 5 行代码。用在:任何场景,默认。
DAPO (ByteDance, 2025)
Four orthogonal tricks bundled together:
4 个正交 trick 打包:
- Clip-higher: asymmetric clip with $\epsilon_\text{high} > \epsilon_\text{low}$ (e.g. 0.28 vs 0.2). Prevents entropy collapse by letting low-probability exploration tokens get amplified.
- Clip-higher:非对称 clip,$\epsilon_\text{high} > \epsilon_\text{low}$(如 0.28 vs 0.2)。允许低概率探索 token 被放大,防熵塌缩。
- Dynamic sampling: skip groups where all rollouts got the same reward ($\sigma=0$ batches). They contribute zero gradient, only waste compute.
- Dynamic sampling:跳过组内 reward 全相同($\sigma=0$)的 batch。零梯度,纯浪费算力。
- Token-level loss: sum per-token loss across the whole group, not per-sample mean. Fixes the same bias Dr.GRPO targets, from a different angle.
- Token-level loss:整组 per-token loss 求和,不是 per-sample mean。从另一个角度修同样的偏差。
- Overlong reward shaping: soft penalty for truncated responses; don’t pretend a hard cap is a real reward signal.
- Overlong reward shaping:截断回复软惩罚;不要把硬截断当 reward 信号用。
Use when: training is unstable, entropy is collapsing, or $\sigma=0$ batches dominate.
用在:训练不稳、熵塌缩、$\sigma=0$ batch 太多的场景。
RLOO (Cohere, 2024)
Cut even further: drop the clip itself, use REINFORCE with a leave-one-out
baseline. The empirical observation: in RLHF the clip almost never fires
($< 5\%$ of updates), so why pay for the complexity?
Use when: short-response RLHF, flat reward distribution, want maximum simplicity. Less reliable on long-CoT.
更激进:连 clip 都去掉,用 REINFORCE + leave-one-out baseline。实证发现:
RLHF 场景下 clip 几乎从不触发($< 5\%$),那为什么还付那套复杂度?
用在:短回复 RLHF,reward 分布平坦,追求极简。长 CoT 上没那么稳。
GSPO (Qwen, 2025)
Move the importance ratio from token-level to sequence-level:
把 importance ratio 从 token 级提到 sequence 级:
$$s_i(\theta) = \left(\prod_{t=1}^{|o_i|} \frac{\pi_\theta(o_t|s_t)}{\pi_{\theta_\text{old}}(o_t|s_t)}\right)^{\!1/|o_i|}$$
The intuition is small but deep: the optimization unit should match the reward unit.
When reward is given at the sequence level, correcting off-policy drift at the
token level introduces variance that doesn’t correspond to any real signal.
GSPO is also stable for MoE models, where per-token routing varies between
policy snapshots and inflates token-level ratios.
Use when: long sequences ($>$ 1K token), MoE training, anywhere token-level ratio variance is hurting you.
洞察很小但很深:优化单元应该和 reward 单元对齐。reward 是 sequence
级时,在 token 级做 off-policy 修正引入的方差并不对应任何真实信号。GSPO 在 MoE
模型上也稳 — MoE 的 per-token 路由在策略快照间会漂,把 token-level ratio 抬高。
用在:长序列($>$ 1K token)、MoE 训练,以及任何 token-level
方差变成训练噪音的场景。
GiGPO (NeurIPS 2025)
For multi-turn agents, trajectory-level advantage is too coarse — it can’t distinguish between "good early action, bad late action" and the reverse. GiGPO adds a second, finer group: for actions taken from the same anchor state across different rollouts, compute an extra step-level advantage. Total advantage is the sum.
多轮 agent 场景下,trajectory 级 advantage 太粗 — 无法区分"早期好动作 + 晚期坏动作"和反过来。GiGPO 加了一层更细的 group:对同一个锚状态 在不同 rollout 中采取的动作,再算一个 step 级 advantage,总 advantage 求和。
The trick works when state is revisitable (the same web page, the same room
in a navigation task). It degenerates to GRPO when state space is high-entropy
— which is exactly the gap my Compiler-as-Reward
work addresses with error-fingerprint anchor states (ESSA).
Use when: multi-turn agent RL with revisitable states.
关键前提是状态可重访(同一个网页、导航任务里同一个房间)。状态空间高熵时它退化为
GRPO — 这正是我的 Compiler-as-Reward 工作用错误指纹锚状态(ESSA)解决的缺口。
用在:多轮 agent RL + 状态可重访。
Picking one (cheat sheet) 选型速查表
| Scenario场景 | Recommended推荐 |
|---|---|
| Math / code, single-turn数学 / code 单轮 | GRPO + Dr.GRPO + DAPO |
| Short-response RLHF短回复 RLHF | RLOO + Dr.GRPO |
| Long-CoT reasoning长 CoT 推理 | GRPO + GSPO + Dr.GRPO |
| MoE trainingMoE 模型训练 | GSPO |
| Multi-turn agent (revisitable states)多轮 agent (状态可重访) | GiGPO + Dr.GRPO |
| Multi-turn agent (high-entropy states)多轮 agent (状态高熵) | GiGPO + ESSA-style fingerprint |
| Current SOTA stack当下 SOTA 组合 | GSPO-token + GiGPO + Dr.GRPO |
The bigger insight 更深的洞察
Once you see this, you can almost predict the next three variants. Someone will redefine the optimization unit again (action-level? thought-level?). Someone will tune the asymmetric clip further (per-token entropy-aware?). Someone will subdivide credit assignment one more level (sub-action? sub-thought?). It’s the same equation each time, with one more knob attached.
看懂这个,你几乎能预测接下来 3 个变体:有人会再重新定义优化单元 (action 级?thought 级?)、有人会继续调非对称 clip(逐 token 熵感知?)、 有人会再细分一层信用分配(sub-action?sub-thought?)。每次都是同一个公式, 多挂一个旋钮。
What to read next 延伸阅读
- For source code — my open-source mlx-agent-rl implements all four critic-free variants (GRPO / Dr.GRPO / DAPO / GiGPO) natively on Apple Silicon.
- 想看源码 — 我开源的 mlx-agent-rl 在 Apple Silicon 上原生实现了全部 4 个 critic-free 变体(GRPO / Dr.GRPO / DAPO / GiGPO)。
- For the long-horizon agent case — my work on Compiler-as-Reward extends GiGPO with process-level reward signals from compilers.
- 长 horizon agent 场景 — Compiler-as-Reward 在 GiGPO 之上加了来自编译器的过程级 reward 信号。