Agent RL Researcher · LLM Post-training · RLVR.
Building long-horizon agents that train themselves.
Agent RL 算法研究员 · LLM 后训练 · RLVR
构建能自我训练的长程智能体。
I research algorithms for AI agents end-to-end — from RLVR and Agentic-RL closed-loop training, long-horizon credit assignment, and reward design, to the agent harness engineering that makes those algorithms ship in production: tool use, memory & reflection, RAG / GraphRAG, multi-agent orchestration.
我研究 AI 智能体的全栈算法 — 从 RLVR 与 Agentic-RL 闭环训练、长程信用分配、 reward 设计,到让算法在生产中跑起来的智能体工程:工具调用、记忆与反思、 RAG / GraphRAG、Multi-Agent 协同。
Currently leading three Agent product lines: a Shopify coding agent that cut store delivery from 3-5 weeks to under 1 day, a shopping guide agent built on Graph + Vector hybrid RAG, and an MCP-based smart customer service agent. Earlier: 3 years on multi-modal product governance and multi-agent inspection at e-commerce scale.
目前主导三条 Agent 产品线:Shopify 建站 Coding Agent(将建站周期从 3-5 周压缩到 1 天内)、智能导购 Agent(基于 Graph + Vector Hybrid RAG)、以及 MCP 协议的 智能客服 Agent。早期 3 年在电商业务上做多模态商品治理与 Multi-Agent 抽检。
Most recent work: Compiler-as-Reward — using process feedback from compilers as RL training signal for coding agents.
最近工作:Compiler-as-Reward — 把编译器的过程反馈作为 Coding Agent 的 RL 训练信号。
Compiler-as-Reward: Process Feedback for Coding Agent RL Training Compiler-as-Reward:面向 Coding Agent 强化学习训练的过程反馈
Proposes Compiler-OPD and Error-Branch — two process reward mechanisms that turn compiler diagnostics into dense RL training signals. The only method that maintains non-zero task success rate at convergence on a 22-task Shopify Horizon coding agent benchmark.
提出 Compiler-OPD 与 Error-Branch 两项过程奖励机制,将编译器诊断信息转化为 稠密的 RL 训练信号。在 22-task Shopify Horizon Coding Agent benchmark 上, 是收敛阶段唯一维持非零任务成功率的方法。
Sparse Black-Box Multimodal Attack for Vision-Language Adversary Generation
A sparse black-box attack method against vision-language models, generating adversarial samples with limited model access while preserving semantic plausibility.
一种针对视觉-语言模型的稀疏黑盒攻击方法,在有限模型访问下生成对抗样本, 同时保持语义合理性。
Knowledge-Guided Adversarial Mutation Learning for Psoriasis-Like Listing Detection
Multimodal product recognition robust to adversarial evasion patterns on e-commerce platforms. Deployed in production for Taobao/Tmall risk control.
针对电商平台对抗规避模式具备鲁棒性的多模态商品识别方法。部署于淘宝/天猫 风控生产环境。
mlx-agent-rl
Author · Maintainer 作者 · 维护者
Native multi-turn Agent RL training framework on Apple Silicon, built on the MLX
ecosystem. Implements the GRPO family of critic-free algorithms: GRPO, Dr.GRPO, DAPO,
and GiGPO (with NeurIPS 2025 two-level advantage estimation). Multi-turn rollout
integrates Policy + Environment + SlidingMemory, updated with PPO-clip. Ships with
four built-in tool environments: calculator, search,
sql, web shopping. Supports 4/8-bit quantization and LoRA.
Apple Silicon 原生的多轮 Agent RL 训练框架,基于 MLX 生态。实现 GRPO 家族 4 个
critic-free 算法:GRPO、Dr.GRPO、DAPO、GiGPO(含 NeurIPS 2025 两级优势估计)。
多轮 rollout 集成 Policy + Environment + SlidingMemory,用 PPO-clip 更新。
内置 4 类工具环境:calculator / search /
sql / web shopping。支持 4/8-bit 量化与 LoRA。
Shopify Store Coding Agent Shopify 建站 Coding Agent
Architect & Lead · 2024.07–now 架构师 & 主导 · 2024.07–至今End-to-end coding agent that autonomously generates production-grade Shopify themes (Liquid/HTML/CSS/JS). Three-layer Agent Harness: deterministic layer (Aider-style RepoMap + Tree-Sitter AST + PageRank symbol ranking, Auto Harness self-evolving rule base), LLM layer (multi-expert agents for clarification, code generation, theme QA), and runtime control (file-system Plan & Memory persistence, 4-breakpoint Prompt Caching, JSONPatch incremental edits, 50-round iteration with error recovery). Built on top: an RLVR closed loop where GRPO trains four binary rewards (compilation / schema / image_config / lint) signalled by a compiler API, an auto harness rule engine, and a live theme dev server.
端到端 Coding Agent,自主生成生产级 Shopify 主题(Liquid/HTML/CSS/JS)。 三层 Agent Harness:确定性层(Aider 风 RepoMap + Tree-Sitter AST + PageRank 符号排序, Auto Harness 自维护规则库)、LLM 层(多专家智能体协同 — 需求澄清、代码生成、主题 QA)、 运行时控制(文件系统 Plan & Memory 持久化、4 断点 Prompt Caching、JSONPatch 增量编辑、50 轮迭代 + 错误恢复)。其上是 RLVR 闭环:GRPO 训练 4 个二值 Reward (编译 / Schema / 图像配置 / Lint),信号来自编译器 API、Auto Harness 规则引擎 与 Theme Dev Server。
Shopping Guide Agent 智能导购 Agent
Architect & Lead · 2024.07–2025.03 架构师 & 主导 · 2024.07–2025.03ReAct-driven product recommendation agent over a hybrid Graph + Vector RAG stack. Auto-built knowledge graph: LLM extracts 30+ relation types from product descriptions, reviews and Q&A, stored across NebulaGraph (scale) + Neo4j (Cypher querying) + NetworkX (in-memory graph algorithms). Three-way parallel retrieval — Graph RAG (LLM-generated Cypher for multi-hop traversal) + Vector RAG (Milvus 768d rerank embedding) + keyword — deduped and re-ranked by LLM. Seven-intent ReAct controller dynamically orchestrates ProductRecall, CategoryTool, ComplianceRAG, SpecificationTool, and ProductRecommender. 20-turn dialog context with explainable match_score output.
面向商品推荐的 ReAct 决策流,叠加 Graph + Vector Hybrid RAG。自动构建 KG:LLM 从 产品描述、评价、Q&A 中抽取 30+ 关系类型,三层图存储 — NebulaGraph(规模) + Neo4j(Cypher 查询)+ NetworkX(内存图算法)。三路并发检索:Graph RAG(LLM 生成 Cypher 做多跳遍历)+ Vector RAG(Milvus 768d rerank embedding)+ 关键词, 去重后 LLM Rerank 精排。七意图 ReAct controller 动态编排 ProductRecall、 CategoryTool、ComplianceRAG、SpecificationTool、ProductRecommender。20 轮对话 上下文 + 可解释的 match_score 输出。
MCP Smart Customer Service Agent MCP 智能客服 Agent
Architect · 2025.09–now 架构师 · 2025.09–至今Root-cause analysis on 100+ real conversations identified the bottleneck as information silos, not model capability. Built 7 MCP-protocol data connectors (subscriptions, orders, logistics, accounts, products, Shopify sync, publish state) to feed the Fin AI Agent. Two-layer routing: keyword fast-path for high-frequency intents (orders/logistics/subscriptions) + LLM Agent fallback for complex intents (complaints/refunds). Strict isolation (order ownership validation) and PII redaction (no internal IDs, wholesale prices, or domestic-segment logistics exposed).
对 100+ 真实对话做根因分析,定位瓶颈是信息孤岛而非模型能力。基于 MCP 协议构建 7 个数据连接器(订阅 / 订单 / 物流 / 账户 / 产品 / Shopify 同步 / 发布状态), 为 Fin AI Agent 提供 context。双层路由:关键词快排高频意图(订单 / 物流 / 订阅) + LLM Agent 兜底复杂意图(投诉 / 退款)。严格隔离(订单归属验证)+ PII 脱敏 (禁暴露内部 ID / 批发价 / 国内段物流)。
Multi-Agent Product Inspection Multi-Agent 商品抽检平台
Senior Algorithm Engineer · 2021.03–2024.03 高级算法工程师 · 2021.03–2024.03Multimodal product governance for tens of millions of SKUs. Built a knowledge base covering 50+ categories by parsing 200+ national standards documents via PP-Structure (layout analysis + multi-modal LLMs extracting tables/charts). Fine-tuned Qwen 7B for review-based intent recognition and attribute-level sentiment. Autogen orchestrated 4-agent workflow (product parse → risk assess → compliance retrieve → inspection recommend). BGE-based RAG matched standards documents to product attributes. Bad-case feedback loop drove rule self-update.
面向千万级 SKU 的多模态商品治理。基于 PP-Structure(版面解析 + 多模态 LLM 提取 表格 / 图表)解析 200+ 国标文件,构建覆盖 50+ 品类的合规检测知识库。 微调 Qwen 7B 做评价意图识别与属性级情感分析。Autogen 编排 4 个 Agent 工作流 (商品解析 → 风险评估 → 合规检索 → 抽检推荐)。BGE 表征做国标文档与商品属性 的 RAG 匹配。Bad-case 反馈闭环驱动规则自更新。
Technical notes on Agent RL, LLM post-training, and agent systems. Read the blog →
关于 Agent RL、LLM 后训练、智能体系统的技术笔记。 查看博客 →