Technical notes on Agent RL, LLM post-training, and agent systems. 关于 Agent RL、LLM 后训练、智能体系统的技术笔记。
A field guide to GRPO and its variants. Each is a 1-line patch to the same equation, attacking a different facet of importance sampling variance. GRPO 家族横向对比:每个都只是对同一个公式的 1 行补丁,各打 importance sampling 方差的不同面。
Intuition behind Compiler-OPD and Error-Branch — why outcome reward collapses on long-horizon coding tasks and what process feedback fixes. Compiler-OPD 和 Error-Branch 的直觉:为什么 outcome reward 在长程编码任务上收敛塌缩,过程反馈解决了什么。
Why I built mlx-agent-rl, design decisions on rollout / memory / advantage, and lessons from porting GRPO-family algorithms to MLX. 为什么写 mlx-agent-rl,rollout / memory / advantage 的设计决策,以及把 GRPO 家族算法移植到 MLX 上的经验。
Posts are being written. Subscribe to GitHub for now. 文章正在准备中。可以先关注 GitHub。