Contributions

  1. Benchmark protocol: a reproducible evaluation suite for cooperative LLM play in Hanabi (17 open weight and proprietary models, 2–5 players, fixed-seed settings, self-play + cross-play).
  2. Scaffolded diagnosis: Our Holmesian scaffolds (Watson/Sherlock/Mycroft) distinguish between LLMs that reason with provided deductions and those that can maintain implicit beliefs over long horizons.
  3. πŸ—‚οΈ New training data release: HanabiLogs (LLM game playing trajectories for SFT) and HanabiRewards (move-level utility annotations as rewards for RL).
  4. Actionable post-training baseline: Qwen3-4B improvements quantify what small open models gain from cooperative trajectory and reward supervision.
  5. Transfer signal: cooperative-reasoning post-training improves multiple out-of-domain tasks, supporting Hanabi as a practical post-training substrate.

Why Hanabi?

Popular single-agent benchmarks currently do not evaluate a specific but important type of intelligence: multiple LLM agents cooperating to solve a single task with partial or incomplete information about their environment and other agents. Hanabi is extremely well suited to this task: between 2 and 5 players hold cards facing outward, visible to everyone but themselves, and must build five color-ordered β€œfireworks” using only color or rank hints from a finite pool of information tokens. Success requires tracking hidden information, inferring teammate intent, and coordinating through sparse signals.

Specialized RL agents reach ~24/25 in 2-player self-play1 but degrade sharply with more players or unfamiliar partners. In this work, we focus on the question: how good are general-purpose LLMs as cooperative agents, and what limits them?

Holmesian Scaffolds

We progressively scale the context an agent receives, from minimal state to engine-provided deductions to fully implicit multi-turn state tracking. Each scaffold isolates a different capability.

01 Β· BASELINE

Watson

Minimal context: game state, visible hands, and explicit knowledge from clues. Nothing else. This establishes a lower bound on what LLMs can do without scaffolding.

02 Β· SCAFFOLDED

Sherlock

Adds engine-computed deductions (per-card "could be" possibilities), Hanabi strategies, and a Bayesian step-by-step prompt. This establishes an upper bound with rich prefill.

03 Β· IMPLICIT

Mycroft

No engine deductions. The agent must implicitly track its own and teammates' beliefs across turns via a structured "scratch pad," closer to how humans actually play Hanabi.

Watson & Sherlock

Watson and Sherlock differ in one key way, i.e., whether the agent receives a programmatic belief state. Sherlock is provided, for every card in every hand, the colors and ranks still consistent with the clue history.2 The agent is then prompted to do Bayesian-style probabilistic reasoning over these candidates before acting.

Watson vs Sherlock prompt comparison
Figure 1. Watson provides only explicit knowledge (clues received). Sherlock additionally provides a Deductive Context block (the per-card belief state) and enforces Bayesian-style step-by-step reasoning.

Mycroft

Mycroft removes Sherlock’s dependency on deductions from an external game engine2. Each turn, the agent receives the previous turn’s game state, its own deductions for every player, move ratings, its chosen action, and the reasoning for its choice. It must then produce updated deductions, ratings, and an action for that turn. This forces the LLM to be its own deductive game engine, tracking belief shifts and card position changes (cards slide left after a play or discard) across 60+ turns.

Mycroft scratch pad example
Figure 2. A Mycroft turn from Player 1's perspective. The agent maintains an independent deduction block for every other player and must update card positions implicitly after plays and discards.

Benchmark results

We evaluate 17 LLMs (open-weights and proprietary, 4B to 600B+, both reasoning and non-reasoning) across 2 to 5 player self-play, with 10 fixed seeds per configuration. Reasoning models clear ~13/25 in Watson; non-reasoning models generally stall below 10/25. Performance tends to drop as the number of players increase (tracking information is harder!), though there are exceptions (e.g. Grok 3 Mini).

Non-reasoning
Reasoning
Model2-Player3-Player4-Player5-Player
Mistral Medium 32.21.91.71.2
Gemini 2.0 Flash4.53.73.33.6
Llama-4 Maverick3.84.45.94.8
GPT-4o5.34.65.34.9
DeepSeek-V35.96.34.35.0
GPT-4.1 mini10.88.38.27.2
Claude Sonnet 3.710.79.28.56.9
Qwen-32B9.99.08.89.2
Grok 39.910.69.38.0
GPT-4.112.111.810.08.2
Gemini 2.5 Flash12.813.813.012.7
Gemini 2.5 Pro13.213.912.912.9
Qwen-235B-A22B15.014.613.012.9
Grok 3 Mini14.213.914.514.8
DeepSeek-R114.215.314.113.4
o4-mini15.015.514.513.9
o315.915.316.413.9
Average scores over 10 seeds per configuration.
Non-reasoning
Reasoning
Model2-Player3-Player4-Player5-Player
Mistral Medium 34.14.85.35.4
Gemini 2.0 Flash4.23.34.04.3
Llama-4 Maverick4.95.25.45.6
GPT-4o4.44.14.54.6
DeepSeek-V33.94.25.45.8
GPT-4.1 mini6.56.15.15.8
Claude Sonnet 3.75.45.45.45.6
Qwen-32B5.613.15.412.1
Grok 312.88.013.35.6
GPT-4.114.816.415.514.4
Gemini 2.5 Flash8.46.67.75.6
Gemini 2.5 Pro12.816.216.914.4
Qwen-235B-A22B14.616.616.713.3
Grok 3 Mini14.416.617.415.5
DeepSeek-R117.516.615.615.1
o4-mini14.618.014.113.0
o317.617.616.815.7
Average scores over 10 seeds per configuration.
Reasoning (only)
Model2-Player3-Player4-Player5-Player
o4-mini10.812.411.310.9
Grok 3 Mini14.216.514.514.4
Gemini 2.5 Pro10.213.414.111.6
Gemini 2.5 Flash11.813.212.39.8
o316.316.415.514.7
Mycroft evaluated on the top 5 reasoning models only. Average scores over 10 seeds.
Table 2. Average scores (out of 25) across all three scaffolds. Watson provides minimal context; Sherlock adds deductive beliefs; Mycroft requires fully implicit state tracking. Best in each column is highlighted.

Ablations

Cross-play

The self-play assumption that all players are essentially identical (the same LLM) is strong and does not hold in real-world ad hoc cooperative settings (no humans are identical!). We thus extend our evaluation to β€œcross-play”, i.e., teams with LLMs of disparate Hanabi playing competence. Specifically, we compose 2-5 player teams of one strong LLM (Grok 3 Mini) and the rest, a weaker LLM (o4-mini). Across all player counts, adding one stronger agent improves a team’s score by 1.7 points on average (see Fig. 8 below). Performance smoothly interpolates between the weak and strong self-play baselines (o4-mini and Grok 3 Mini respectively), unlike specialized RL agents which collapse with unfamiliar partners.

Cross-play interpolation
Figure 8. Mixed teams score between weak (all o4-mini) and strong (all Grok 3 Mini) self-play, demonstrating that LLM agents cooperate gracefully with unfamiliar partners, in meaningful contrast with traditional self-play RL.

Best-of-K

Can we get better performance by majority voting over K move candidates by sampling the agent k times? We provide these K chosen moves and reasoning to the agent and ask it to pick the optimal move with the best strategic thinking. With Watson, performance climbs through K=5 (+1.5 on average) and then plateaus. With Sherlock, gains are negligible (+0.1) because a well-designed scaffold with verifiable deductive reasoning tends to converge to the same chosen action regardless of how many times we sample the LLM. Better context beats best-of-k sampling!

Mixture of Agents

Inspired by Mixture of Agents, we assign five specialized roles to sub-agents that execute in parallel (Baseline, Rank-Focused, Analyst, Discard Strategist, History Analyst) and aggregate their proposals via a sixth "Aggregator" agent.

MoA modestly improves the 5-player setting (+1.1 with Watson, +0.8 with Sherlock over Best-of-5) but introduces high variance: speculative high-risk sub-agents (especially the History Analyst) occasionally mislead the aggregator and tank a run.

Mixture of Agents architecture

Encouraging agent move selection diversity can sometimes help, but there is a fine line between diversity and unreliability.

Post-training on a 4B LLM closes the gap to Frontier models

To validate our datasets, we post-train Qwen3-4B-Instruct-2507, a small non-reasoning model, on data we collect from o3 and Grok 3 Mini:

  • HanabiLogs (1,520+ game trajectories): used for supervised finetuning (SFT).
  • HanabiRewards (560+ games with dense move-level utility annotations): used for Reinforcement Learning with Verifiable Rewards via GRPO.

The Mycroft base model scores a very low 1.7/25, indicating low base Hanabi competence. After RL on HanabiRewards it reaches 8.3/25, a +388% jump that lands within ~3 points of o4-mini (11.3) and surpasses GPT-4.1 (the best non-reasoning baseline) by +88%. In Sherlock, the same model jumps from 4.8 to 12.3 (+156%), comparable to Grok 3 and beating GPT-4o.

Sherlock post-training results Mycroft post-training results
Figure 9. Qwen3-4B before and after instruction tuning (Ours-SFT) and RLVR (Ours-RL), versus larger proprietary models. Evaluated on held-out seeds to avoid leakage.

Generalizing Beyond Hanabi

Now for the big (and fun) question: what else does getting really good at Hanabi teach the LLM? As it turns out, training on our new HanabiRewards data improves scores on four out-of-domain benchmarks:

Table 1. Qwen3-4B base vs. our RL-finetuned model. Group Guessing is wins/200 games (cooperative); EventQA is 6-way MCQ accuracy at increasing context lengths (temporal reasoning); IFBench is strict instruction-following; AIME 2025 measures math reasoning.
Model Group Guess
(1st / 2nd run)
EventQA
(64K / 128K / 800K)
IFBench
(Avg / Pass@10)
AIME 2025
(Avg / Pass@10)
Base 61.0 / 60.5 84.0 / 62.6 / 37.2 30.9 / 42.9 48.7 / 73.3
Ours-RL 73.0 / 71.5 85.6 / 66.8 / 43.6 31.5 / 44.6 50.0 / 73.3
Ξ” +12.0 / +11.0 +1.6 / +4.2 / +6.4 +0.6 / +1.7 +1.3 / +0.0

Our post-trained model’s temporal-reasoning ability (EventQA) grows with context length (+1.6 β†’ +4.2 β†’ +6.4 from 64K β†’ 128K β†’ 800K), providing evidence that encouraging the LLM to implicitly track Hanabi state over long games (60+ turns) generalizes to long-horizon belief tracking in other tasks. Our post-trained model also shows strong gains on a held-out cooperative task (Group Guessing game) and general instruction-following capabilities (IFBench), with small mathematical reasoning improvements (AIME 2025).

Takeaways

  1. Modern reasoning LLMs show sparks of cooperative reasoning, but reliable multi-agent coordination remains unsolved. The best LLMs score between 15 and 18 out of 25 in self-play, comfortably below specialized RL agents (>23) and the median human Hanabi player (~18 to 21).
  2. Scaffold design matters more than model scale. Moving from Watson to Sherlock improves reasoning models by +2.0 on average; the same scaffold hurts most non-reasoning models. Different families respond differently to identical context.
  3. Implicit state and belief tracking is an open and important problem, especially over many turns. Even a strong reasoning model like o3 drops 1.2 points when moving from engine-provided deductions to self-tracking and Gemini 2.5 Pro drops 3.7 points. Multi-turn belief maintenance is where current models break.
  4. Cross-play interpolates gracefully. Unlike specialized RL agents, LLMs interpolate smoothly between weak and strong teammates, showing a small but real β€œspark” of cooperative generalization.
  5. A 4B model can carry surprising weight. Post-training on our new datasets closes most of the gap to frontier reasoning models on Hanabi and transfers to general-purpose temporal reasoning, instruction following and mathematical reasoning, as well as out-of-domain cooperative tasks.

Citation

BibTeX
@misc{ramesh2026sparkscooperativereasoningllms,
      title={Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents}, 
      author={Mahesh Ramesh and Kaousheik Jayakumar and Aswinkumar Ramkumar and Pavan Thodima and Aniket Rege and Emmanouil-Vasileios Vlatakis-Gkaragkounis},
      year={2026},
      eprint={2601.18077},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.18077}, 
}
  1. Self-Play refers to a Hanabi game where all players/agents have the same LLM backbone (e.g. GPT-4o).Β 

  2. Sherlock’s programmatic candidate sets are computed with Google DeepMind’s Hanabi Learning Environment.Β Β 2