Contributions
- Benchmark protocol: a reproducible evaluation suite for cooperative LLM play in Hanabi (17 open weight and proprietary models, 2β5 players, fixed-seed settings, self-play + cross-play).
- Scaffolded diagnosis: Our Holmesian scaffolds (Watson/Sherlock/Mycroft) distinguish between LLMs that reason with provided deductions and those that can maintain implicit beliefs over long horizons.
- ποΈ New training data release: HanabiLogs (LLM game playing trajectories for SFT) and HanabiRewards (move-level utility annotations as rewards for RL).
- Actionable post-training baseline: Qwen3-4B improvements quantify what small open models gain from cooperative trajectory and reward supervision.
- Transfer signal: cooperative-reasoning post-training improves multiple out-of-domain tasks, supporting Hanabi as a practical post-training substrate.
Why Hanabi?
Popular single-agent benchmarks currently do not evaluate a specific but important type of intelligence: multiple LLM agents cooperating to solve a single task with partial or incomplete information about their environment and other agents. Hanabi is extremely well suited to this task: between 2 and 5 players hold cards facing outward, visible to everyone but themselves, and must build five color-ordered βfireworksβ using only color or rank hints from a finite pool of information tokens. Success requires tracking hidden information, inferring teammate intent, and coordinating through sparse signals.
Specialized RL agents reach ~24/25 in 2-player self-play1 but degrade sharply with more players or unfamiliar partners. In this work, we focus on the question: how good are general-purpose LLMs as cooperative agents, and what limits them?
Holmesian Scaffolds
We progressively scale the context an agent receives, from minimal state to engine-provided deductions to fully implicit multi-turn state tracking. Each scaffold isolates a different capability.
Watson
Minimal context: game state, visible hands, and explicit knowledge from clues. Nothing else. This establishes a lower bound on what LLMs can do without scaffolding.
Sherlock
Adds engine-computed deductions (per-card "could be" possibilities), Hanabi strategies, and a Bayesian step-by-step prompt. This establishes an upper bound with rich prefill.
Mycroft
No engine deductions. The agent must implicitly track its own and teammates' beliefs across turns via a structured "scratch pad," closer to how humans actually play Hanabi.
Watson & Sherlock
Watson and Sherlock differ in one key way, i.e., whether the agent receives a programmatic belief state. Sherlock is provided, for every card in every hand, the colors and ranks still consistent with the clue history.2 The agent is then prompted to do Bayesian-style probabilistic reasoning over these candidates before acting.
Mycroft
Mycroft removes Sherlockβs dependency on deductions from an external game engine2. Each turn, the agent receives the previous turnβs game state, its own deductions for every player, move ratings, its chosen action, and the reasoning for its choice. It must then produce updated deductions, ratings, and an action for that turn. This forces the LLM to be its own deductive game engine, tracking belief shifts and card position changes (cards slide left after a play or discard) across 60+ turns.
Benchmark results
We evaluate 17 LLMs (open-weights and proprietary, 4B to 600B+, both reasoning and non-reasoning) across 2 to 5 player self-play, with 10 fixed seeds per configuration. Reasoning models clear ~13/25 in Watson; non-reasoning models generally stall below 10/25. Performance tends to drop as the number of players increase (tracking information is harder!), though there are exceptions (e.g. Grok 3 Mini).
| Model | 2-Player | 3-Player | 4-Player | 5-Player |
|---|---|---|---|---|
| Mistral Medium 3 | 2.2 | 1.9 | 1.7 | 1.2 |
| Gemini 2.0 Flash | 4.5 | 3.7 | 3.3 | 3.6 |
| Llama-4 Maverick | 3.8 | 4.4 | 5.9 | 4.8 |
| GPT-4o | 5.3 | 4.6 | 5.3 | 4.9 |
| DeepSeek-V3 | 5.9 | 6.3 | 4.3 | 5.0 |
| GPT-4.1 mini | 10.8 | 8.3 | 8.2 | 7.2 |
| Claude Sonnet 3.7 | 10.7 | 9.2 | 8.5 | 6.9 |
| Qwen-32B | 9.9 | 9.0 | 8.8 | 9.2 |
| Grok 3 | 9.9 | 10.6 | 9.3 | 8.0 |
| GPT-4.1 | 12.1 | 11.8 | 10.0 | 8.2 |
| Gemini 2.5 Flash | 12.8 | 13.8 | 13.0 | 12.7 |
| Gemini 2.5 Pro | 13.2 | 13.9 | 12.9 | 12.9 |
| Qwen-235B-A22B | 15.0 | 14.6 | 13.0 | 12.9 |
| Grok 3 Mini | 14.2 | 13.9 | 14.5 | 14.8 |
| DeepSeek-R1 | 14.2 | 15.3 | 14.1 | 13.4 |
| o4-mini | 15.0 | 15.5 | 14.5 | 13.9 |
| o3 | 15.9 | 15.3 | 16.4 | 13.9 |
| Model | 2-Player | 3-Player | 4-Player | 5-Player |
|---|---|---|---|---|
| Mistral Medium 3 | 4.1 | 4.8 | 5.3 | 5.4 |
| Gemini 2.0 Flash | 4.2 | 3.3 | 4.0 | 4.3 |
| Llama-4 Maverick | 4.9 | 5.2 | 5.4 | 5.6 |
| GPT-4o | 4.4 | 4.1 | 4.5 | 4.6 |
| DeepSeek-V3 | 3.9 | 4.2 | 5.4 | 5.8 |
| GPT-4.1 mini | 6.5 | 6.1 | 5.1 | 5.8 |
| Claude Sonnet 3.7 | 5.4 | 5.4 | 5.4 | 5.6 |
| Qwen-32B | 5.6 | 13.1 | 5.4 | 12.1 |
| Grok 3 | 12.8 | 8.0 | 13.3 | 5.6 |
| GPT-4.1 | 14.8 | 16.4 | 15.5 | 14.4 |
| Gemini 2.5 Flash | 8.4 | 6.6 | 7.7 | 5.6 |
| Gemini 2.5 Pro | 12.8 | 16.2 | 16.9 | 14.4 |
| Qwen-235B-A22B | 14.6 | 16.6 | 16.7 | 13.3 |
| Grok 3 Mini | 14.4 | 16.6 | 17.4 | 15.5 |
| DeepSeek-R1 | 17.5 | 16.6 | 15.6 | 15.1 |
| o4-mini | 14.6 | 18.0 | 14.1 | 13.0 |
| o3 | 17.6 | 17.6 | 16.8 | 15.7 |
| Model | 2-Player | 3-Player | 4-Player | 5-Player |
|---|---|---|---|---|
| o4-mini | 10.8 | 12.4 | 11.3 | 10.9 |
| Grok 3 Mini | 14.2 | 16.5 | 14.5 | 14.4 |
| Gemini 2.5 Pro | 10.2 | 13.4 | 14.1 | 11.6 |
| Gemini 2.5 Flash | 11.8 | 13.2 | 12.3 | 9.8 |
| o3 | 16.3 | 16.4 | 15.5 | 14.7 |
Ablations
Cross-play
The self-play assumption that all players are essentially identical (the same LLM) is strong and does not hold in real-world ad hoc cooperative settings (no humans are identical!). We thus extend our evaluation to βcross-playβ, i.e., teams with LLMs of disparate Hanabi playing competence. Specifically, we compose 2-5 player teams of one strong LLM (Grok 3 Mini) and the rest, a weaker LLM (o4-mini). Across all player counts, adding one stronger agent improves a teamβs score by 1.7 points on average (see Fig. 8 below). Performance smoothly interpolates between the weak and strong self-play baselines (o4-mini and Grok 3 Mini respectively), unlike specialized RL agents which collapse with unfamiliar partners.
Best-of-K
Can we get better performance by majority voting over K move candidates by sampling the agent k times? We provide these K chosen moves and reasoning to the agent and ask it to pick the optimal move with the best strategic thinking. With Watson, performance climbs through K=5 (+1.5 on average) and then plateaus. With Sherlock, gains are negligible (+0.1) because a well-designed scaffold with verifiable deductive reasoning tends to converge to the same chosen action regardless of how many times we sample the LLM. Better context beats best-of-k sampling!
Mixture of Agents
Inspired by Mixture of Agents, we assign five specialized roles to sub-agents that execute in parallel (Baseline, Rank-Focused, Analyst, Discard Strategist, History Analyst) and aggregate their proposals via a sixth "Aggregator" agent.
MoA modestly improves the 5-player setting (+1.1 with Watson, +0.8 with Sherlock over Best-of-5) but introduces high variance: speculative high-risk sub-agents (especially the History Analyst) occasionally mislead the aggregator and tank a run.
Encouraging agent move selection diversity can sometimes help, but there is a fine line between diversity and unreliability.
Post-training on a 4B LLM closes the gap to Frontier models
To validate our datasets, we post-train Qwen3-4B-Instruct-2507, a small non-reasoning model, on data we collect from o3 and Grok 3 Mini:
- HanabiLogs (1,520+ game trajectories): used for supervised finetuning (SFT).
- HanabiRewards (560+ games with dense move-level utility annotations): used for Reinforcement Learning with Verifiable Rewards via GRPO.
The Mycroft base model scores a very low 1.7/25, indicating low base Hanabi competence. After RL on HanabiRewards it reaches 8.3/25, a +388% jump that lands within ~3 points of o4-mini (11.3) and surpasses GPT-4.1 (the best non-reasoning baseline) by +88%. In Sherlock, the same model jumps from 4.8 to 12.3 (+156%), comparable to Grok 3 and beating GPT-4o.
Generalizing Beyond Hanabi
Now for the big (and fun) question: what else does getting really good at Hanabi teach the LLM? As it turns out, training on our new HanabiRewards data improves scores on four out-of-domain benchmarks:
| Model | Group Guess (1st / 2nd run) |
EventQA (64K / 128K / 800K) |
IFBench (Avg / Pass@10) |
AIME 2025 (Avg / Pass@10) |
|---|---|---|---|---|
| Base | 61.0 / 60.5 | 84.0 / 62.6 / 37.2 | 30.9 / 42.9 | 48.7 / 73.3 |
| Ours-RL | 73.0 / 71.5 | 85.6 / 66.8 / 43.6 | 31.5 / 44.6 | 50.0 / 73.3 |
| Ξ | +12.0 / +11.0 | +1.6 / +4.2 / +6.4 | +0.6 / +1.7 | +1.3 / +0.0 |
Our post-trained modelβs temporal-reasoning ability (EventQA) grows with context length (+1.6 β +4.2 β +6.4 from 64K β 128K β 800K), providing evidence that encouraging the LLM to implicitly track Hanabi state over long games (60+ turns) generalizes to long-horizon belief tracking in other tasks. Our post-trained model also shows strong gains on a held-out cooperative task (Group Guessing game) and general instruction-following capabilities (IFBench), with small mathematical reasoning improvements (AIME 2025).
Takeaways
- Modern reasoning LLMs show sparks of cooperative reasoning, but reliable multi-agent coordination remains unsolved. The best LLMs score between 15 and 18 out of 25 in self-play, comfortably below specialized RL agents (>23) and the median human Hanabi player (~18 to 21).
- Scaffold design matters more than model scale. Moving from Watson to Sherlock improves reasoning models by +2.0 on average; the same scaffold hurts most non-reasoning models. Different families respond differently to identical context.
- Implicit state and belief tracking is an open and important problem, especially over many turns. Even a strong reasoning model like o3 drops 1.2 points when moving from engine-provided deductions to self-tracking and Gemini 2.5 Pro drops 3.7 points. Multi-turn belief maintenance is where current models break.
- Cross-play interpolates gracefully. Unlike specialized RL agents, LLMs interpolate smoothly between weak and strong teammates, showing a small but real βsparkβ of cooperative generalization.
- A 4B model can carry surprising weight. Post-training on our new datasets closes most of the gap to frontier reasoning models on Hanabi and transfers to general-purpose temporal reasoning, instruction following and mathematical reasoning, as well as out-of-domain cooperative tasks.
Citation
@misc{ramesh2026sparkscooperativereasoningllms,
title={Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents},
author={Mahesh Ramesh and Kaousheik Jayakumar and Aswinkumar Ramkumar and Pavan Thodima and Aniket Rege and Emmanouil-Vasileios Vlatakis-Gkaragkounis},
year={2026},
eprint={2601.18077},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.18077},
}
-
Self-Play refers to a Hanabi game where all players/agents have the same LLM backbone (e.g. GPT-4o).Β ↩
-
Sherlockβs programmatic candidate sets are computed with Google DeepMindβs Hanabi Learning Environment.Β ↩Β ↩2
Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents