Kaousheik Jayakumar

Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents

Sat, 21 Mar 2026 00:00:00 +0000

Contributions

Benchmark 17 LLMs as Hanabi agents across 2–5 player settings.
Introduce Mycroft, a scaffold for implicit multi-turn state tracking.
Study self-play, cross-play, best-of-K, and mixture-of-agent settings.
Release trajectories and move-rated data for SFT/RL training.
Post-train Qwen3-4B and show gains in Hanabi and transfer tasks.

🗂️ Released data

• HanabiLogs: LLM gameplay trajectories for SFT
• HanabiRewards: move-level ratings / judge scores for RL-style training
• Models include o3, Gemini 2.5 Pro, o4-mini, Grok, DeepSeek, Qwen, and others.

Why Hanabi

Cooperative coordination under partial information is the part of intelligence that single-agent benchmarks miss. Hanabi is the canonical testbed: 2 to 5 players hold cards facing outward, visible to everyone but themselves, and must build five color-ordered “fireworks” using only color or rank hints from a finite pool of information tokens. Success requires tracking hidden information, inferring teammate intent, and coordinating through sparse signals.

Specialized RL agents reach ~24/25 in 2-player self-play but degrade sharply with more players or unfamiliar partners. We ask a different question: how good are general-purpose LLMs as cooperative agents, and what limits them?

Three scaffolds

We progressively scale the context an agent receives, from minimal state to engine-provided deductions to fully implicit multi-turn state tracking. Each scaffold isolates a different capability.

01 · BASELINE

Watson

Minimal context: game state, visible hands, and explicit knowledge from clues. Nothing else. This establishes a lower bound on what LLMs can do without scaffolding.

02 · SCAFFOLDED

Sherlock

Adds engine-computed deductive context (per-card "could be" possibilities), Hanabi strategy notes, and a Bayesian step-by-step prompt. Establishes an upper bound with rich prefill.

03 · IMPLICIT

Mycroft

No engine deductions. The agent must implicitly track its own and teammates' beliefs across turns via a structured "scratch pad," closer to how humans actually play.

Watson & Sherlock

Watson and Sherlock differ in one thing: whether the agent receives a programmatic belief state. Sherlock’s deductive context lists, for every card in every hand, the colors and ranks still consistent with the clue history. The agent is then prompted to do Bayesian-style probability reasoning over those candidates before acting.

Figure 1. Watson provides only explicit knowledge (clues received). Sherlock additionally provides a Deductive Context block (the per-card belief state) and enforces Bayesian-style step-by-step reasoning.

Mycroft

Mycroft removes the engine crutch. Each turn, the agent receives the previous turn’s game state, its own deductions for every player, move ratings, chosen action, and reasoning. It must then produce updated deductions, ratings, and an action. This forces the model to be its own Hanabi Learning Environment, tracking belief shifts and card position changes (cards slide left after a play or discard) across 60+ turns.

Figure 2. A Mycroft turn from Player 1's perspective. The agent maintains an independent deduction block for every other player and must update card positions implicitly after plays and discards.

Benchmark results

We evaluate 17 LLMs (4B to 600B+, both reasoning and non-reasoning) across 2 to 5 player self-play, with 10 fixed seeds per configuration. Reasoning models clear ~13/25 in Watson; non-reasoning models mostly stall below 10/25.

Non-reasoning

Reasoning

Model	2-Player	3-Player	4-Player	5-Player
Mistral Medium 3	2.2	1.9	1.7	1.2
Gemini 2.0 Flash	4.5	3.7	3.3	3.6
Llama-4 Maverick	3.8	4.4	5.9	4.8
GPT-4o	5.3	4.6	5.3	4.9
DeepSeek-V3	5.9	6.3	4.3	5.0
GPT-4.1 mini	10.8	8.3	8.2	7.2
Claude Sonnet 3.7	10.7	9.2	8.5	6.9
Qwen-32B	9.9	9.0	8.8	9.2
Grok-3	9.9	10.6	9.3	8.0
GPT-4.1	12.1	11.8	10.0	8.2
Gemini 2.5 Flash	12.8	13.8	13.0	12.7
Gemini 2.5 Pro	13.2	13.9	12.9	12.9
Qwen-235B-A22B	15.0	14.6	13.0	12.9
Grok-3 Mini	14.2	13.9	14.5	14.8
DeepSeek-R1	14.2	15.3	14.1	13.4
o4-mini	15.0	15.5	14.5	13.9
o3	15.9	15.3	16.4	13.9

Average scores over 10 seeds per configuration.

Non-reasoning

Reasoning

Model	2-Player	3-Player	4-Player	5-Player
Mistral Medium 3	4.1	4.8	5.3	5.4
Gemini 2.0 Flash	4.2	3.3	4.0	4.3
Llama-4 Maverick	4.9	5.2	5.4	5.6
GPT-4o	4.4	4.1	4.5	4.6
DeepSeek-V3	3.9	4.2	5.4	5.8
GPT-4.1 mini	6.5	6.1	5.1	5.8
Claude Sonnet 3.7	5.4	5.4	5.4	5.6
Qwen-32B	5.6	13.1	5.4	12.1
Grok-3	12.8	8.0	13.3	5.6
GPT-4.1	14.8	16.4	15.5	14.4
Gemini 2.5 Flash	8.4	6.6	7.7	5.6
Gemini 2.5 Pro	12.8	16.2	16.9	14.4
Qwen-235B-A22B	14.6	16.6	16.7	13.3
Grok-3 Mini	14.4	16.6	17.4	15.5
DeepSeek-R1	17.5	16.6	15.6	15.1
o4-mini	14.6	18.0	14.1	13.0
o3	17.6	17.6	16.8	15.7

Average scores over 10 seeds per configuration.

Reasoning (only)

Model	2-Player	3-Player	4-Player	5-Player
o4-mini	10.8	12.4	11.3	10.9
Grok-3 Mini	14.2	16.5	14.5	14.4
Gemini 2.5 Pro	10.2	13.4	14.1	11.6
Gemini 2.5 Flash	11.8	13.2	12.3	9.8
o3	16.3	16.4	15.5	14.7

Mycroft evaluated on the top 5 reasoning models only. Average scores over 10 seeds.

Table 2. Average scores (out of 25) across all three scaffolds. Watson provides minimal context; Sherlock adds deductive beliefs; Mycroft requires fully implicit state tracking. Best in each column is highlighted.

Ablations

Cross-play

Self-play is generous; real cooperation is ad hoc. We compose teams with one Grok-3-mini agent and the rest o4-mini (the weaker model in Mycroft, 14.9 vs 11.3). Across all 2 to 5 player settings, adding one stronger agent lifts team scores by ~1.7 points. Performance smoothly interpolates between the weak and strong self-play baselines, unlike specialized RL agents which collapse with unfamiliar partners.

Figure 8. Mixed teams score between weak (all o4-mini) and strong (all Grok-3-mini) self-play, demonstrating that LLM agents cooperate gracefully with unfamiliar partners, in meaningful contrast with traditional self-play RL.

Best-of-K

Sample the agent K times per turn and ask it to pick its best candidate. With Watson, performance climbs through K=5 (+1.5 on average) then plateaus. With Sherlock, gains are negligible (+0.1) because a well-engineered prompt mostly converges to the same action across samples, so naive scaling does not help. Better context beats more samples.

Mixture of agents

To break sample homogeneity, we run five role-specialized agents in parallel (Baseline, Rank-Focused, Analyst, Discard Strategist, History Analyst) and aggregate their proposals via a sixth "finalizer" agent.

MoA modestly improves the 5-player setting (+1.1 with Watson, +0.8 with Sherlock over Best-of-5) but introduces high variance: speculative agents (especially the History Analyst) occasionally mislead the aggregator and tank a run. Diversity helps when it lands; reliability remains the open problem.

Post-training: closing the gap with a 4B model

To validate our datasets, we post-train Qwen3-4B-Instruct-2507, a small non-reasoning model, on data collected from o3 and Grok-3-mini.

HanabiLogs (1,520+ trajectories): used for supervised finetuning.
HanabiRewards (560+ games with dense move-level utility annotations): used for RLVR via GRPO.

The base model scores 1.7 in Mycroft. After RL on HanabiRewards it reaches 8.3, a +388% jump that lands within ~3 points of o4-mini (11.3) and surpasses GPT-4.1 (the best non-reasoning baseline) by +88%. In Sherlock, the same model jumps from 4.8 to 12.3 (+156%), comparable to Grok-3 and beating GPT-4o.

Figure 9. Qwen3-4B before and after instruction tuning (Ours-SFT) and RLVR (Ours-RL), versus larger proprietary models. Evaluated on held-out seeds to avoid leakage.

Generalization beyond Hanabi

The interesting result isn’t just “we got better at Hanabi.” Training on HanabiRewards transfers to four out-of-domain benchmarks, with no degradation on math.

Table 1. Qwen3-4B base vs. our RL-finetuned model. Group Guessing is wins/200 games (cooperative); EventQA is 6-way MCQ accuracy at increasing context lengths (temporal reasoning); IFBench is strict instruction-following; AIME 2025 measures math reasoning.
Model	Group Guess (1st / 2nd run)	EventQA (64K / 128K / 800K)	IFBench (Avg / Pass@10)	AIME 2025 (Avg / Pass@10)
Base	61.0 / 60.5	84.0 / 62.6 / 37.2	30.9 / 42.9	48.7 / 73.3
Ours-RL	73.0 / 71.5	85.6 / 66.8 / 43.6	31.5 / 44.6	50.0 / 73.3
Δ	+12.0 / +11.0	+1.6 / +4.2 / +6.4	+0.6 / +1.7	+1.3 / +0.0

The temporal-reasoning lift on EventQA grows with context length (+1.6, +4.2, +6.4 from 64K to 800K), which we read as evidence that learning to implicitly track Hanabi state generalizes to long-horizon belief tracking elsewhere. AIME stays flat, with no catastrophic forgetting on math.

Takeaways

Modern reasoning LLMs show sparks of cooperative reasoning, but reliable multi-agent coordination remains unsolved. The best score ~15 to 18/25 in self-play, comfortably below specialized agents (>23) and the median human Hanabi player (~18 to 21).
Scaffold design matters more than model scale. Moving from Watson to Sherlock improves reasoning models by +2.0 on average; the same scaffold hurts most non-reasoning models. Different families respond differently to identical context.
Implicit state tracking is the open problem. Even o3 drops 1.2 points moving from engine-provided deductions to self-tracking; Gemini 2.5 Pro drops 3.7. Multi-turn belief maintenance is where current models break.
Cross-play is graceful. Unlike specialized RL agents, LLMs interpolate smoothly between weak and strong teammates, showing a small but real “spark” of cooperative generalization.
A 4B model can carry surprising weight. Post-training on our datasets closes most of the gap to frontier reasoning models on Hanabi and transfers to temporal reasoning, instruction following, and out-of-domain cooperation.

Citation

BibTeX

@misc{ramesh2026sparkscooperativereasoningllms,
      title={Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents}, 
      author={Mahesh Ramesh and Kaousheik Jayakumar and Aswinkumar Ramkumar and Pavan Thodima and Aniket Rege and Emmanouil-Vasileios Vlatakis-Gkaragkounis},
      year={2026},
      eprint={2601.18077},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.18077}, 
}

Where Does the Sound Go? Probing Audio-Visual Language Models for Modality Bias

Sat, 21 Mar 2026 00:00:00 +0000

Motivation

A new generation of audio-visual large language models — Gemini, GPT-4o’s audio mode, Qwen2-Audio, and several open-weights successors — markets itself as truly multimodal: pass in a video clip and they will answer questions about both what’s on the screen and what’s on the soundtrack. The marketing demos are compelling. A car horn off-screen, a dog barking behind the camera, a violinist tuning before the visual cut — these are exactly the kinds of cases where audio carries information vision can’t.

But how much of the answer actually comes from the audio? When the model says “the woman is playing a violin,” is that because it heard the bowing or because it saw the instrument? When you swap the soundtrack for white noise, does the answer change at all?

This work probes audio-visual LLMs for modality bias — specifically, how much weight the model genuinely places on the audio stream when both modalities are available. We find a sharp and consistent pattern: across four open-weights AV-LLMs and three closed APIs, vision dominates the prediction in roughly 87% of cases where audio and vision disagree. We then trace where in the network the audio signal gets attenuated, and find the bottleneck is concentrated in the cross-modal projection layers, not in the audio encoder itself.

Qualitative examples

Before the numbers, a flavor of what we mean. Consider a clip of someone slicing a cucumber on a wooden cutting board. The visual is unambiguous — knife, cucumber, board. The audio is the percussive thock-thock of blade on wood. Now we replace the audio with the sound of a violin tuning, and ask the model: “What is happening in this video?” Every model we tested answered some variant of “a person is slicing a cucumber” — the violin sound was completely ignored.

The mirror experiment is equally telling. We take a clip of a violinist mid-performance, mute the violin, and dub in cucumber-chopping audio. The models still describe a violinist. Vision wins both times. The audio stream might as well not exist for these examples.

The full paper has a gallery of around 60 such pairs, organized by the type of audio-visual conflict (object identity, action, environment, speaker characteristics). The pattern is remarkably consistent across model families.

How we study this

Counterfactual probes

To measure modality reliance directly, we built a dataset of 2,400 counterfactual video pairs. Each pair shares one modality and swaps the other: same video, two soundtracks; or same soundtrack, two videos. We then ask each model the same open-ended question about the clip and measure how often its answer flips when we swap the audio versus when we swap the video.

A model that genuinely fuses both modalities should produce different answers for the two audio conditions in cases where the audio is informative. A model that ignores audio will produce identical answers regardless of what’s on the soundtrack. The ratio of these two flip rates is what we call the modality reliance ratio, and across all seven models we tested, it’s heavily skewed toward vision: typical values land between 0.08 and 0.15, meaning audio swaps change the answer roughly an order of magnitude less often than video swaps do.

Layer-wise attribution

The flip-rate experiments tell us that vision dominates. To understand where the audio signal gets dropped, we run gradient-based attribution at every layer of the model, following the audio token contributions from the audio encoder all the way through the cross-modal projector and into the language model’s residual stream.

The picture that emerges is striking. Inside the audio encoder, audio tokens carry meaningful, distinguishable representations — different sounds produce different embeddings, and a linear probe can recover the underlying class with high accuracy. The information is there. But once those tokens pass through the cross-modal projection layer that maps them into the language model’s embedding space, their gradient contribution to the final answer drops by roughly 70%. By the time the signal reaches the LLM’s middle layers, audio tokens are contributing less than 5% of the residual stream norm at the answer position.

In other words: the audio encoder is doing its job. The language model is mostly ignoring its output.

Findings

Does the model pay attention to audio

Attention rollout from the answer token back to the input shows that audio tokens receive between 2% and 8% of total attention mass across the seven models we tested, while vision tokens receive between 60% and 80%. This is roughly proportional to token count (vision contributes more tokens), but per-token attention is still 2–3× higher for vision tokens. The model is not weighting the modalities equally.

Are audio representations meaningful

A natural worry: maybe the audio encoder is just bad. We rule this out with linear probes. A simple linear classifier trained on the audio encoder’s output can distinguish 80+ environmental sound classes with above-90% accuracy, and can identify speaker gender with above-95%. The representations are rich and well-separated. The bottleneck is downstream.

Following audio tokens through the cross-modal projector, we observe a dramatic compression. The cosine similarity between input audio embeddings and the projected versions used by the LLM drops to around 0.3 — meaning the projector is largely overwriting the audio encoder’s structure with whatever the LLM expects to receive. We hypothesize this is because the projector was trained on vision-heavy data and learned a mapping that’s effectively a noise channel for audio.

Where does the vision bias originate

To pin down whether the bias is learned or architectural, we re-trained the cross-modal projector on a balanced audio-visual dataset where audio is the only informative signal in 50% of examples. The bias drops substantially: modality reliance ratio rises from ~0.12 to ~0.41 after just a few thousand fine-tuning steps. The vision bias is not architectural — it’s a training-data artifact that can be partially undone with targeted data, but only by deliberately oversampling audio-critical examples.

Takeaways

Two things to walk away with. First, “multimodal” is doing a lot of unverified work in current AV-LLM marketing. These models can process audio, but they mostly don’t. Anyone deploying them in settings where audio is safety-critical — accessibility tools for blind users, audio-based anomaly detection, anything where the soundtrack carries information the visuals don’t — should test for this bias before trusting the output.

Second, the bias is fixable. The audio encoder is competent. The projector is the bottleneck. Targeted fine-tuning on audio-critical examples meaningfully shifts the modality reliance ratio. There’s no architectural reason these models have to be vision-blind to audio — we just trained them that way.

Citation

BibTeX

@article{jayakumar2026where,
  title={Where Does the Sound Go? Probing Audio-Visual Language Models for Modality Bias},
  author={Jayakumar, Kaousheik and Rege, Aniket and Ramesh, Mahesh},
  journal={arXiv preprint},
  year={2026}
}

Kaousheik Jayakumar

Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents

Contributions

🗂️ Released data

Why Hanabi

Three scaffolds

Watson

Sherlock

Mycroft

Watson & Sherlock

Mycroft

Benchmark results

Ablations

Cross-play

Best-of-K

Mixture of agents

Post-training: closing the gap with a 4B model

Generalization beyond Hanabi

Takeaways

Citation

Where Does the Sound Go? Probing Audio-Visual Language Models for Modality Bias

Motivation

Qualitative examples

How we study this

Counterfactual probes

Layer-wise attribution

Findings

Does the model pay attention to audio

Are audio representations meaningful

How does cross-modal information flow

Where does the vision bias originate

Takeaways

Citation