My RL trained multi-agent-coding model Orca-Agent-v0.1-14B reached a 167% higher relative score than its base model on Stanford's TerminalBench. I've open sourced everything.
*What I did:*
- I trained a 14B orchestrator model to better coordinate explorer & coder subagents (subagents are tool calls for orchestrator)
- Scaled to 32x H100s that were pushed to their limits across 4 bare-metal nodes
- Scaled to 256 Docker environments rolling out simultaneously, automatically distributed across the cluster
*Key results:*
- Qwen3-14B jumped from *7% → 18.25%* on TerminalBench after training
- Model now within striking distance of Qwen3-Coder-480B (19.7%)
- Training was stable with smooth entropy decrease and healthy gradient norms
*Training approach:*
Reward design and biggest learning: Kept it simple - *just unit tests*. Every "smart" reward signal I tried to craft led to policy collapse
Curriculum learning:
- Stage-1: Tasks where base model succeeded 1-2/3 times (41 tasks)
- Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times
Dataset: Used synthetically generated RL environments and unit tests
*More details:*
I have added lots more details in the repo linked to this submission, including training code, model weights, datasets.
Huge thanks to:
- Tara for providing the compute
- Prime Intellect team for building prime-rl and dealing with my endless questions
- Alex Dimakis for the conversation that sparked training the orchestrator model
Thanks for reading!
Dan
(Evaluated on the excellent TerminalBench benchmark by Stanford & Laude Institute)
Hitting a million brick walls with multi-turn RL training isn't fun, so I thought I would try something new to climb Stanford's leaderboard for now! So this weekend I was just tinkering with multi-agent systems and... somehow ended up beating Claude Code on Stanford's TerminalBench leaderboard (#12)! Genuinely didn't expect this - started as a fun experiment and ended up with something that works surprisingly well.
*What I did:*
- Built a multi-agent AI system with three specialised agents:
- Orchestrator: The brain - never touches code, just delegates and coordinates
- Explorer agents: Read & run only investigators that gather intel
- Coder agents: The ones who actually implement stuff
- Created a "Context Store" which can be thought of as persistent memory that lets agents share their discoveries.
- Tested on TerminalBench with both Claude Sonnet-4 and Qwen3-Coder-480B.
*Key results:*
- Orchestrator + Sonnet-4: 36.0% success rate (#12 on leaderboard, ahead of Claude Code!)
- Sonnet-4 consumed 93.2M tokens vs Qwen's 14.7M tokens to compete all tasks!
- The orchestrator's explicit task delegation + intelligent context sharing between subagents seems to be the secret sauce
*(Kind of) Technical details:*
- The orchestrator can't read/write code directly - this forces proper delegation patterns and strategic planning
- Each agent gets precise instructions about what "knowledge artifacts" to return, these artifacts are then stored, and can be provided to future subagents upon launch.
Exactly my first thought when I realised the cost!
Currently LoRA is not supported by rLLM (The team told me they aim to support in next release), but it is certainly possible to port to verl directly or another RL framework for sure. I just did not have the time to port again (already done 2x as other RL frameworks had issues)
*What I did:*
- I trained a 14B orchestrator model to better coordinate explorer & coder subagents (subagents are tool calls for orchestrator) - Scaled to 32x H100s that were pushed to their limits across 4 bare-metal nodes - Scaled to 256 Docker environments rolling out simultaneously, automatically distributed across the cluster
*Key results:*
- Qwen3-14B jumped from *7% → 18.25%* on TerminalBench after training - Model now within striking distance of Qwen3-Coder-480B (19.7%) - Training was stable with smooth entropy decrease and healthy gradient norms
*Training approach:*
Reward design and biggest learning: Kept it simple - *just unit tests*. Every "smart" reward signal I tried to craft led to policy collapse
Curriculum learning: - Stage-1: Tasks where base model succeeded 1-2/3 times (41 tasks) - Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times
Dataset: Used synthetically generated RL environments and unit tests
*More details:*
I have added lots more details in the repo linked to this submission, including training code, model weights, datasets.
Huge thanks to: - Tara for providing the compute - Prime Intellect team for building prime-rl and dealing with my endless questions - Alex Dimakis for the conversation that sparked training the orchestrator model
Thanks for reading!
Dan
(Evaluated on the excellent TerminalBench benchmark by Stanford & Laude Institute)