Writing — Junhao Fu

The GRPO Family in 5 Minutes — GRPO, Dr.GRPO, DAPO, RLOO, GSPO, GiGPO GRPO 家族 5 分钟通关 — GRPO / Dr.GRPO / DAPO / RLOO / GSPO / GiGPO

May 25, 2026 · 7 min · Agent RL 2026 年 5 月 25 日 · 7 分钟 · Agent RL algorithm

A field guide to GRPO and its variants. Each is a 1-line patch to the same equation, attacking a different facet of importance sampling variance. GRPO 家族横向对比:每个都只是对同一个公式的 1 行补丁,各打 importance sampling 方差的不同面。
Compiler-as-Reward — Process Feedback for Coding Agent RL Compiler-as-Reward — 把编译器变成 RL 的过程奖励

Planned · Research 计划中 · 研究 paper

Intuition behind Compiler-OPD and Error-Branch — why outcome reward collapses on long-horizon coding tasks and what process feedback fixes. Compiler-OPD 和 Error-Branch 的直觉:为什么 outcome reward 在长程编码任务上收敛塌缩,过程反馈解决了什么。
Building mlx-agent-rl — Multi-turn Agent RL on Apple Silicon 从零到 mlx-agent-rl — 在 Apple Silicon 上做多轮 Agent RL

Planned · Open Source 计划中 · 开源 infra

Why I built mlx-agent-rl, design decisions on rollout / memory / advantage, and lessons from porting GRPO-family algorithms to MLX. 为什么写 mlx-agent-rl,rollout / memory / advantage 的设计决策,以及把 GRPO 家族算法移植到 MLX 上的经验。

Posts are being written. Subscribe to GitHub for now. 文章正在准备中。可以先关注 GitHub。