<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Kaousheik Jayakumar</title>
    <description></description>
    <link>https://kaousheik-26.github.io//</link>
    <atom:link href="https://kaousheik-26.github.io//feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Wed, 13 May 2026 01:09:06 +0000</pubDate>
    <lastBuildDate>Wed, 13 May 2026 01:09:06 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents</title>
        <description>&lt;div class=&quot;contributions-box&quot; style=&quot;background: var(--bg-elevated); border-left: 3px solid var(--accent); padding: 1.25rem 1.5rem; border-radius: 6px; margin-bottom: 2rem; box-shadow: var(--shadow-sm); border-top: 1px solid var(--border); border-right: 1px solid var(--border); border-bottom: 1px solid var(--border);&quot;&gt;
  &lt;h4 style=&quot;margin-top: 0; color: var(--accent); font-family: &apos;JetBrains Mono&apos;, monospace; font-size: 0.75rem; text-transform: uppercase; letter-spacing: 0.1em; margin-bottom: 0.75rem;&quot;&gt;Contributions&lt;/h4&gt;
  &lt;ol style=&quot;margin-bottom: 0; padding-left: 1.25rem;&quot;&gt;
    &lt;li style=&quot;margin-bottom: 0.4rem;&quot;&gt;Benchmark 17 LLMs as Hanabi agents across 2–5 player settings.&lt;/li&gt;
    &lt;li style=&quot;margin-bottom: 0.4rem;&quot;&gt;Introduce Mycroft, a scaffold for implicit multi-turn state tracking.&lt;/li&gt;
    &lt;li style=&quot;margin-bottom: 0.4rem;&quot;&gt;Study self-play, cross-play, best-of-K, and mixture-of-agent settings.&lt;/li&gt;
    &lt;li style=&quot;margin-bottom: 0.4rem;&quot;&gt;Release trajectories and move-rated data for SFT/RL training.&lt;/li&gt;
    &lt;li&gt;Post-train Qwen3-4B and show gains in Hanabi and transfer tasks.&lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;

&lt;div class=&quot;dataset-card&quot; style=&quot;background: var(--bg-elevated); border: 1px solid var(--border); padding: 1.25rem 1.5rem; border-radius: 6px; margin-bottom: 2.5rem; box-shadow: var(--shadow-sm);&quot;&gt;
  &lt;h4 style=&quot;margin-top: 0; font-family: &apos;JetBrains Mono&apos;, monospace; font-size: 0.75rem; text-transform: uppercase; letter-spacing: 0.1em; margin-bottom: 0.75rem;&quot;&gt;&lt;span style=&quot;margin-right: 6px;&quot;&gt;🗂️&lt;/span&gt; Released data&lt;/h4&gt;
  &lt;ul style=&quot;margin-bottom: 0; padding-left: 1.25rem; list-style-type: none;&quot;&gt;
    &lt;li style=&quot;margin-bottom: 0.4rem; position: relative;&quot;&gt;&lt;span style=&quot;position: absolute; left: -1.25rem; color: var(--accent);&quot;&gt;•&lt;/span&gt; &lt;strong&gt;HanabiLogs:&lt;/strong&gt; LLM gameplay trajectories for SFT&lt;/li&gt;
    &lt;li style=&quot;margin-bottom: 0.4rem; position: relative;&quot;&gt;&lt;span style=&quot;position: absolute; left: -1.25rem; color: var(--accent);&quot;&gt;•&lt;/span&gt; &lt;strong&gt;HanabiRewards:&lt;/strong&gt; move-level ratings / judge scores for RL-style training&lt;/li&gt;
    &lt;li style=&quot;position: relative;&quot;&gt;&lt;span style=&quot;position: absolute; left: -1.25rem; color: var(--accent);&quot;&gt;•&lt;/span&gt; Models include o3, Gemini 2.5 Pro, o4-mini, Grok, DeepSeek, Qwen, and others.&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;h2 id=&quot;why-hanabi&quot;&gt;Why Hanabi&lt;/h2&gt;

&lt;p&gt;Cooperative coordination under partial information is the part of intelligence that single-agent benchmarks miss. &lt;strong&gt;Hanabi&lt;/strong&gt; is the canonical testbed: 2 to 5 players hold cards facing outward, visible to everyone but themselves, and must build five color-ordered “fireworks” using only color or rank hints from a finite pool of information tokens. Success requires tracking hidden information, inferring teammate intent, and coordinating through sparse signals.&lt;/p&gt;

&lt;p&gt;Specialized RL agents reach ~24/25 in 2-player self-play but degrade sharply with more players or unfamiliar partners. We ask a different question: &lt;strong&gt;how good are general-purpose LLMs as cooperative agents, and what limits them?&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id=&quot;three-scaffolds&quot;&gt;Three scaffolds&lt;/h2&gt;

&lt;p&gt;We progressively scale the context an agent receives, from minimal state to engine-provided deductions to fully implicit multi-turn state tracking. Each scaffold isolates a different capability.&lt;/p&gt;

&lt;div class=&quot;settings-grid&quot;&gt;
  &lt;div class=&quot;setting-card&quot;&gt;
    &lt;div class=&quot;ord&quot;&gt;01 · BASELINE&lt;/div&gt;
    &lt;h4&gt;Watson&lt;/h4&gt;
    &lt;p&gt;Minimal context: game state, visible hands, and explicit knowledge from clues. Nothing else. This establishes a lower bound on what LLMs can do without scaffolding.&lt;/p&gt;
  &lt;/div&gt;
  &lt;div class=&quot;setting-card&quot;&gt;
    &lt;div class=&quot;ord&quot;&gt;02 · SCAFFOLDED&lt;/div&gt;
    &lt;h4&gt;Sherlock&lt;/h4&gt;
    &lt;p&gt;Adds engine-computed deductive context (per-card &quot;could be&quot; possibilities), Hanabi strategy notes, and a Bayesian step-by-step prompt. Establishes an upper bound with rich prefill.&lt;/p&gt;
  &lt;/div&gt;
  &lt;div class=&quot;setting-card&quot;&gt;
    &lt;div class=&quot;ord&quot;&gt;03 · IMPLICIT&lt;/div&gt;
    &lt;h4&gt;Mycroft&lt;/h4&gt;
    &lt;p&gt;No engine deductions. The agent must implicitly track its own and teammates&apos; beliefs across turns via a structured &quot;scratch pad,&quot; closer to how humans actually play.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;watson-sherlock&quot;&gt;Watson &amp;amp; Sherlock&lt;/h3&gt;

&lt;p&gt;Watson and Sherlock differ in one thing: whether the agent receives a programmatic belief state. Sherlock’s deductive context lists, for every card in every hand, the colors and ranks still consistent with the clue history. The agent is then prompted to do Bayesian-style probability reasoning over those candidates before acting.&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&quot;https://kaousheik-26.github.io//assets/icml/sherlock_watson_teaser.png&quot; alt=&quot;Watson vs Sherlock prompt comparison&quot; /&gt;
  &lt;figcaption&gt;&lt;strong&gt;Figure 1.&lt;/strong&gt; Watson provides only explicit knowledge (clues received). Sherlock additionally provides a Deductive Context block (the per-card belief state) and enforces Bayesian-style step-by-step reasoning.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;mycroft&quot;&gt;Mycroft&lt;/h3&gt;

&lt;p&gt;Mycroft removes the engine crutch. Each turn, the agent receives the previous turn’s game state, its own deductions for every player, move ratings, chosen action, and reasoning. It must then produce updated deductions, ratings, and an action. This forces the model to be its own Hanabi Learning Environment, tracking belief shifts and card position changes (cards slide left after a play or discard) across 60+ turns.&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&quot;https://kaousheik-26.github.io//assets/icml/mycroft_teaser.png&quot; alt=&quot;Mycroft scratch pad example&quot; /&gt;
  &lt;figcaption&gt;&lt;strong&gt;Figure 2.&lt;/strong&gt; A Mycroft turn from Player 1&apos;s perspective. The agent maintains an independent deduction block for every other player and must update card positions implicitly after plays and discards.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;benchmark-results&quot;&gt;Benchmark results&lt;/h2&gt;

&lt;p&gt;We evaluate &lt;strong&gt;17 LLMs&lt;/strong&gt; (4B to 600B+, both reasoning and non-reasoning) across 2 to 5 player self-play, with 10 fixed seeds per configuration. Reasoning models clear ~13/25 in Watson; non-reasoning models mostly stall below 10/25.&lt;/p&gt;

&lt;div class=&quot;results-tabbed&quot; id=&quot;results-table&quot;&gt;

  &lt;!-- Left-side tabs --&gt;
  &lt;div class=&quot;results-tabs&quot;&gt;
    &lt;button class=&quot;results-tab active&quot; data-panel=&quot;panel-watson&quot;&gt;
      &lt;span class=&quot;tab-ord&quot;&gt;01 · Baseline&lt;/span&gt; Watson
    &lt;/button&gt;
    &lt;button class=&quot;results-tab&quot; data-panel=&quot;panel-sherlock&quot;&gt;
      &lt;span class=&quot;tab-ord&quot;&gt;02 · Scaffolded&lt;/span&gt; Sherlock
    &lt;/button&gt;
    &lt;button class=&quot;results-tab&quot; data-panel=&quot;panel-mycroft&quot;&gt;
      &lt;span class=&quot;tab-ord&quot;&gt;03 · Implicit&lt;/span&gt; Mycroft
    &lt;/button&gt;
  &lt;/div&gt;

  &lt;!-- Panels --&gt;
  &lt;div class=&quot;results-panels&quot;&gt;

    &lt;!-- ─── Watson ─── --&gt;
    &lt;div class=&quot;results-panel active&quot; id=&quot;panel-watson&quot;&gt;
      &lt;div class=&quot;results-legend&quot;&gt;
        &lt;div class=&quot;legend-item&quot;&gt;&lt;span class=&quot;swatch non-reasoning&quot;&gt;&lt;/span&gt; Non-reasoning&lt;/div&gt;
        &lt;div class=&quot;legend-item&quot;&gt;&lt;span class=&quot;swatch reasoning&quot;&gt;&lt;/span&gt; Reasoning&lt;/div&gt;
      &lt;/div&gt;
      &lt;table&gt;
        &lt;thead&gt;
          &lt;tr&gt;&lt;th&gt;Model&lt;/th&gt;&lt;th&gt;2-Player&lt;/th&gt;&lt;th&gt;3-Player&lt;/th&gt;&lt;th&gt;4-Player&lt;/th&gt;&lt;th&gt;5-Player&lt;/th&gt;&lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Mistral Medium 3&lt;/td&gt;&lt;td&gt;2.2&lt;/td&gt;&lt;td&gt;1.9&lt;/td&gt;&lt;td&gt;1.7&lt;/td&gt;&lt;td&gt;1.2&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Gemini 2.0 Flash&lt;/td&gt;&lt;td&gt;4.5&lt;/td&gt;&lt;td&gt;3.7&lt;/td&gt;&lt;td&gt;3.3&lt;/td&gt;&lt;td&gt;3.6&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Llama-4 Maverick&lt;/td&gt;&lt;td&gt;3.8&lt;/td&gt;&lt;td&gt;4.4&lt;/td&gt;&lt;td&gt;5.9&lt;/td&gt;&lt;td&gt;4.8&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;GPT-4o&lt;/td&gt;&lt;td&gt;5.3&lt;/td&gt;&lt;td&gt;4.6&lt;/td&gt;&lt;td&gt;5.3&lt;/td&gt;&lt;td&gt;4.9&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;DeepSeek-V3&lt;/td&gt;&lt;td&gt;5.9&lt;/td&gt;&lt;td&gt;6.3&lt;/td&gt;&lt;td&gt;4.3&lt;/td&gt;&lt;td&gt;5.0&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;GPT-4.1 mini&lt;/td&gt;&lt;td&gt;10.8&lt;/td&gt;&lt;td&gt;8.3&lt;/td&gt;&lt;td&gt;8.2&lt;/td&gt;&lt;td&gt;7.2&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Claude Sonnet 3.7&lt;/td&gt;&lt;td&gt;10.7&lt;/td&gt;&lt;td&gt;9.2&lt;/td&gt;&lt;td&gt;8.5&lt;/td&gt;&lt;td&gt;6.9&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Qwen-32B&lt;/td&gt;&lt;td&gt;9.9&lt;/td&gt;&lt;td&gt;9.0&lt;/td&gt;&lt;td&gt;8.8&lt;/td&gt;&lt;td&gt;9.2&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Grok-3&lt;/td&gt;&lt;td&gt;9.9&lt;/td&gt;&lt;td&gt;10.6&lt;/td&gt;&lt;td&gt;9.3&lt;/td&gt;&lt;td&gt;8.0&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;GPT-4.1&lt;/td&gt;&lt;td&gt;12.1&lt;/td&gt;&lt;td&gt;11.8&lt;/td&gt;&lt;td&gt;10.0&lt;/td&gt;&lt;td&gt;8.2&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;&lt;td&gt;12.8&lt;/td&gt;&lt;td&gt;13.8&lt;/td&gt;&lt;td&gt;13.0&lt;/td&gt;&lt;td&gt;12.7&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;&lt;td&gt;13.2&lt;/td&gt;&lt;td&gt;13.9&lt;/td&gt;&lt;td&gt;12.9&lt;/td&gt;&lt;td&gt;12.9&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Qwen-235B-A22B&lt;/td&gt;&lt;td&gt;15.0&lt;/td&gt;&lt;td&gt;14.6&lt;/td&gt;&lt;td&gt;13.0&lt;/td&gt;&lt;td&gt;12.9&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Grok-3 Mini&lt;/td&gt;&lt;td&gt;14.2&lt;/td&gt;&lt;td&gt;13.9&lt;/td&gt;&lt;td&gt;14.5&lt;/td&gt;&lt;td&gt;14.8&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;DeepSeek-R1&lt;/td&gt;&lt;td&gt;14.2&lt;/td&gt;&lt;td&gt;15.3&lt;/td&gt;&lt;td&gt;14.1&lt;/td&gt;&lt;td&gt;13.4&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;o4-mini&lt;/td&gt;&lt;td&gt;15.0&lt;/td&gt;&lt;td&gt;15.5&lt;/td&gt;&lt;td&gt;14.5&lt;/td&gt;&lt;td&gt;13.9&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;o3&lt;/td&gt;&lt;td&gt;15.9&lt;/td&gt;&lt;td&gt;15.3&lt;/td&gt;&lt;td&gt;16.4&lt;/td&gt;&lt;td&gt;13.9&lt;/td&gt;&lt;/tr&gt;
        &lt;/tbody&gt;
      &lt;/table&gt;
      &lt;div class=&quot;panel-note&quot;&gt;Average scores over 10 seeds per configuration.&lt;/div&gt;
    &lt;/div&gt;

    &lt;!-- ─── Sherlock ─── --&gt;
    &lt;div class=&quot;results-panel&quot; id=&quot;panel-sherlock&quot;&gt;
      &lt;div class=&quot;results-legend&quot;&gt;
        &lt;div class=&quot;legend-item&quot;&gt;&lt;span class=&quot;swatch non-reasoning&quot;&gt;&lt;/span&gt; Non-reasoning&lt;/div&gt;
        &lt;div class=&quot;legend-item&quot;&gt;&lt;span class=&quot;swatch reasoning&quot;&gt;&lt;/span&gt; Reasoning&lt;/div&gt;
      &lt;/div&gt;
      &lt;table&gt;
        &lt;thead&gt;
          &lt;tr&gt;&lt;th&gt;Model&lt;/th&gt;&lt;th&gt;2-Player&lt;/th&gt;&lt;th&gt;3-Player&lt;/th&gt;&lt;th&gt;4-Player&lt;/th&gt;&lt;th&gt;5-Player&lt;/th&gt;&lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Mistral Medium 3&lt;/td&gt;&lt;td&gt;4.1&lt;/td&gt;&lt;td&gt;4.8&lt;/td&gt;&lt;td&gt;5.3&lt;/td&gt;&lt;td&gt;5.4&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Gemini 2.0 Flash&lt;/td&gt;&lt;td&gt;4.2&lt;/td&gt;&lt;td&gt;3.3&lt;/td&gt;&lt;td&gt;4.0&lt;/td&gt;&lt;td&gt;4.3&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Llama-4 Maverick&lt;/td&gt;&lt;td&gt;4.9&lt;/td&gt;&lt;td&gt;5.2&lt;/td&gt;&lt;td&gt;5.4&lt;/td&gt;&lt;td&gt;5.6&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;GPT-4o&lt;/td&gt;&lt;td&gt;4.4&lt;/td&gt;&lt;td&gt;4.1&lt;/td&gt;&lt;td&gt;4.5&lt;/td&gt;&lt;td&gt;4.6&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;DeepSeek-V3&lt;/td&gt;&lt;td&gt;3.9&lt;/td&gt;&lt;td&gt;4.2&lt;/td&gt;&lt;td&gt;5.4&lt;/td&gt;&lt;td&gt;5.8&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;GPT-4.1 mini&lt;/td&gt;&lt;td&gt;6.5&lt;/td&gt;&lt;td&gt;6.1&lt;/td&gt;&lt;td&gt;5.1&lt;/td&gt;&lt;td&gt;5.8&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Claude Sonnet 3.7&lt;/td&gt;&lt;td&gt;5.4&lt;/td&gt;&lt;td&gt;5.4&lt;/td&gt;&lt;td&gt;5.4&lt;/td&gt;&lt;td&gt;5.6&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Qwen-32B&lt;/td&gt;&lt;td&gt;5.6&lt;/td&gt;&lt;td&gt;13.1&lt;/td&gt;&lt;td&gt;5.4&lt;/td&gt;&lt;td&gt;12.1&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;non-reasoning&quot;&gt;&lt;td&gt;Grok-3&lt;/td&gt;&lt;td&gt;12.8&lt;/td&gt;&lt;td&gt;8.0&lt;/td&gt;&lt;td&gt;13.3&lt;/td&gt;&lt;td&gt;5.6&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;GPT-4.1&lt;/td&gt;&lt;td&gt;14.8&lt;/td&gt;&lt;td&gt;16.4&lt;/td&gt;&lt;td&gt;15.5&lt;/td&gt;&lt;td&gt;14.4&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;&lt;td&gt;8.4&lt;/td&gt;&lt;td&gt;6.6&lt;/td&gt;&lt;td&gt;7.7&lt;/td&gt;&lt;td&gt;5.6&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;&lt;td&gt;12.8&lt;/td&gt;&lt;td&gt;16.2&lt;/td&gt;&lt;td&gt;16.9&lt;/td&gt;&lt;td&gt;14.4&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Qwen-235B-A22B&lt;/td&gt;&lt;td&gt;14.6&lt;/td&gt;&lt;td&gt;16.6&lt;/td&gt;&lt;td&gt;16.7&lt;/td&gt;&lt;td&gt;13.3&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Grok-3 Mini&lt;/td&gt;&lt;td&gt;14.4&lt;/td&gt;&lt;td&gt;16.6&lt;/td&gt;&lt;td&gt;17.4&lt;/td&gt;&lt;td&gt;15.5&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;DeepSeek-R1&lt;/td&gt;&lt;td&gt;17.5&lt;/td&gt;&lt;td&gt;16.6&lt;/td&gt;&lt;td&gt;15.6&lt;/td&gt;&lt;td&gt;15.1&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;o4-mini&lt;/td&gt;&lt;td&gt;14.6&lt;/td&gt;&lt;td&gt;18.0&lt;/td&gt;&lt;td&gt;14.1&lt;/td&gt;&lt;td&gt;13.0&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;o3&lt;/td&gt;&lt;td&gt;17.6&lt;/td&gt;&lt;td&gt;17.6&lt;/td&gt;&lt;td&gt;16.8&lt;/td&gt;&lt;td&gt;15.7&lt;/td&gt;&lt;/tr&gt;
        &lt;/tbody&gt;
      &lt;/table&gt;
      &lt;div class=&quot;panel-note&quot;&gt;Average scores over 10 seeds per configuration.&lt;/div&gt;
    &lt;/div&gt;

    &lt;!-- ─── Mycroft ─── --&gt;
    &lt;div class=&quot;results-panel&quot; id=&quot;panel-mycroft&quot;&gt;
      &lt;div class=&quot;results-legend&quot;&gt;
        &lt;div class=&quot;legend-item&quot;&gt;&lt;span class=&quot;swatch reasoning&quot;&gt;&lt;/span&gt; Reasoning (only)&lt;/div&gt;
      &lt;/div&gt;
      &lt;table&gt;
        &lt;thead&gt;
          &lt;tr&gt;&lt;th&gt;Model&lt;/th&gt;&lt;th&gt;2-Player&lt;/th&gt;&lt;th&gt;3-Player&lt;/th&gt;&lt;th&gt;4-Player&lt;/th&gt;&lt;th&gt;5-Player&lt;/th&gt;&lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;o4-mini&lt;/td&gt;&lt;td&gt;10.8&lt;/td&gt;&lt;td&gt;12.4&lt;/td&gt;&lt;td&gt;11.3&lt;/td&gt;&lt;td&gt;10.9&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Grok-3 Mini&lt;/td&gt;&lt;td&gt;14.2&lt;/td&gt;&lt;td&gt;16.5&lt;/td&gt;&lt;td&gt;14.5&lt;/td&gt;&lt;td&gt;14.4&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;&lt;td&gt;10.2&lt;/td&gt;&lt;td&gt;13.4&lt;/td&gt;&lt;td&gt;14.1&lt;/td&gt;&lt;td&gt;11.6&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;&lt;td&gt;11.8&lt;/td&gt;&lt;td&gt;13.2&lt;/td&gt;&lt;td&gt;12.3&lt;/td&gt;&lt;td&gt;9.8&lt;/td&gt;&lt;/tr&gt;
          &lt;tr class=&quot;reasoning&quot;&gt;&lt;td&gt;o3&lt;/td&gt;&lt;td&gt;16.3&lt;/td&gt;&lt;td&gt;16.4&lt;/td&gt;&lt;td&gt;15.5&lt;/td&gt;&lt;td&gt;14.7&lt;/td&gt;&lt;/tr&gt;
        &lt;/tbody&gt;
      &lt;/table&gt;
      &lt;div class=&quot;panel-note&quot;&gt;Mycroft evaluated on the top 5 reasoning models only. Average scores over 10 seeds.&lt;/div&gt;
    &lt;/div&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;figcaption style=&quot;margin-top: -0.5rem; font-size: 0.8rem; color: var(--text3); line-height: 1.5;&quot;&gt;&lt;strong style=&quot;color: var(--text2);&quot;&gt;Table 2.&lt;/strong&gt; Average scores (out of 25) across all three scaffolds. Watson provides minimal context; Sherlock adds deductive beliefs; Mycroft requires fully implicit state tracking. Best in each column is highlighted.&lt;/figcaption&gt;

&lt;h2 id=&quot;ablations&quot;&gt;Ablations&lt;/h2&gt;

&lt;h3 id=&quot;cross-play&quot;&gt;Cross-play&lt;/h3&gt;

&lt;p&gt;Self-play is generous; real cooperation is ad hoc. We compose teams with one Grok-3-mini agent and the rest o4-mini (the weaker model in Mycroft, 14.9 vs 11.3). Across all 2 to 5 player settings, &lt;strong&gt;adding one stronger agent lifts team scores by ~1.7 points&lt;/strong&gt;. Performance smoothly interpolates between the weak and strong self-play baselines, unlike specialized RL agents which collapse with unfamiliar partners.&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&quot;https://kaousheik-26.github.io//assets/icml/cross_play.png&quot; alt=&quot;Cross-play interpolation&quot; /&gt;
  &lt;figcaption&gt;&lt;strong&gt;Figure 8.&lt;/strong&gt; Mixed teams score between weak (all o4-mini) and strong (all Grok-3-mini) self-play, demonstrating that LLM agents cooperate gracefully with unfamiliar partners, in meaningful contrast with traditional self-play RL.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;best-of-k&quot;&gt;Best-of-K&lt;/h3&gt;

&lt;p&gt;Sample the agent K times per turn and ask it to pick its best candidate. With Watson, performance climbs through K=5 (+1.5 on average) then plateaus. With Sherlock, gains are negligible (+0.1) because a well-engineered prompt mostly converges to the same action across samples, so naive scaling does not help. &lt;strong&gt;Better context beats more samples.&lt;/strong&gt;&lt;/p&gt;

&lt;h3 id=&quot;moa&quot;&gt;Mixture of agents&lt;/h3&gt;

&lt;div class=&quot;text-img-row&quot;&gt;
  &lt;div class=&quot;text-side&quot;&gt;
    &lt;p&gt;To break sample homogeneity, we run five role-specialized agents in parallel (Baseline, Rank-Focused, Analyst, Discard Strategist, History Analyst) and aggregate their proposals via a sixth &quot;finalizer&quot; agent.&lt;/p&gt;
    &lt;p&gt;MoA modestly improves the 5-player setting (+1.1 with Watson, +0.8 with Sherlock over Best-of-5) but introduces high variance: speculative agents (especially the History Analyst) occasionally mislead the aggregator and tank a run. Diversity helps when it lands; reliability remains the open problem.&lt;/p&gt;
  &lt;/div&gt;
  &lt;div class=&quot;img-side&quot;&gt;
    &lt;img src=&quot;https://kaousheik-26.github.io//assets/icml/moa.png&quot; alt=&quot;Mixture of Agents architecture&quot; /&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;post-training&quot;&gt;Post-training: closing the gap with a 4B model&lt;/h2&gt;

&lt;p&gt;To validate our datasets, we post-train &lt;strong&gt;Qwen3-4B-Instruct-2507&lt;/strong&gt;, a small non-reasoning model, on data collected from o3 and Grok-3-mini.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;HanabiLogs&lt;/strong&gt; (1,520+ trajectories): used for supervised finetuning.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;HanabiRewards&lt;/strong&gt; (560+ games with dense move-level utility annotations): used for RLVR via GRPO.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The base model scores 1.7 in Mycroft. After RL on HanabiRewards it reaches &lt;strong&gt;8.3&lt;/strong&gt;, a +388% jump that lands within ~3 points of o4-mini (11.3) and surpasses GPT-4.1 (the best non-reasoning baseline) by +88%. In Sherlock, the same model jumps from 4.8 to 12.3 (+156%), comparable to Grok-3 and beating GPT-4o.&lt;/p&gt;

&lt;figure&gt;
  &lt;div class=&quot;img-row&quot; style=&quot;margin: 0;&quot;&gt;
    &lt;img src=&quot;https://kaousheik-26.github.io//assets/icml/Sherlock_finetune.png&quot; alt=&quot;Sherlock post-training results&quot; /&gt;
    &lt;img src=&quot;https://kaousheik-26.github.io//assets/icml/mycroft_finetuned.png&quot; alt=&quot;Mycroft post-training results&quot; /&gt;
  &lt;/div&gt;
  &lt;figcaption&gt;&lt;strong&gt;Figure 9.&lt;/strong&gt; Qwen3-4B before and after instruction tuning (Ours-SFT) and RLVR (Ours-RL), versus larger proprietary models. Evaluated on held-out seeds to avoid leakage.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;generalization&quot;&gt;Generalization beyond Hanabi&lt;/h3&gt;

&lt;p&gt;The interesting result isn’t just “we got better at Hanabi.” Training on HanabiRewards transfers to four out-of-domain benchmarks, with no degradation on math.&lt;/p&gt;

&lt;div class=&quot;table-wrap&quot;&gt;
  &lt;table&gt;
    &lt;caption&gt;Table 1. Qwen3-4B base vs. our RL-finetuned model. Group Guessing is wins/200 games (cooperative); EventQA is 6-way MCQ accuracy at increasing context lengths (temporal reasoning); IFBench is strict instruction-following; AIME 2025 measures math reasoning.&lt;/caption&gt;
    &lt;thead&gt;
      &lt;tr&gt;
        &lt;th&gt;Model&lt;/th&gt;
        &lt;th&gt;Group Guess&lt;br /&gt;&lt;span style=&quot;font-weight:400;font-size:0.7rem;&quot;&gt;(1st / 2nd run)&lt;/span&gt;&lt;/th&gt;
        &lt;th&gt;EventQA&lt;br /&gt;&lt;span style=&quot;font-weight:400;font-size:0.7rem;&quot;&gt;(64K / 128K / 800K)&lt;/span&gt;&lt;/th&gt;
        &lt;th&gt;IFBench&lt;br /&gt;&lt;span style=&quot;font-weight:400;font-size:0.7rem;&quot;&gt;(Avg / Pass@10)&lt;/span&gt;&lt;/th&gt;
        &lt;th&gt;AIME 2025&lt;br /&gt;&lt;span style=&quot;font-weight:400;font-size:0.7rem;&quot;&gt;(Avg / Pass@10)&lt;/span&gt;&lt;/th&gt;
      &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
      &lt;tr&gt;
        &lt;td&gt;Base&lt;/td&gt;
        &lt;td class=&quot;num&quot;&gt;61.0 / 60.5&lt;/td&gt;
        &lt;td class=&quot;num&quot;&gt;84.0 / 62.6 / 37.2&lt;/td&gt;
        &lt;td class=&quot;num&quot;&gt;30.9 / 42.9&lt;/td&gt;
        &lt;td class=&quot;num&quot;&gt;48.7 / 73.3&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;&lt;strong&gt;Ours-RL&lt;/strong&gt;&lt;/td&gt;
        &lt;td class=&quot;num delta-pos&quot;&gt;73.0 / 71.5&lt;/td&gt;
        &lt;td class=&quot;num delta-pos&quot;&gt;85.6 / 66.8 / 43.6&lt;/td&gt;
        &lt;td class=&quot;num delta-pos&quot;&gt;31.5 / 44.6&lt;/td&gt;
        &lt;td class=&quot;num&quot;&gt;50.0 / 73.3&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;Δ&lt;/td&gt;
        &lt;td class=&quot;num delta-pos&quot;&gt;+12.0 / +11.0&lt;/td&gt;
        &lt;td class=&quot;num delta-pos&quot;&gt;+1.6 / +4.2 / +6.4&lt;/td&gt;
        &lt;td class=&quot;num delta-pos&quot;&gt;+0.6 / +1.7&lt;/td&gt;
        &lt;td class=&quot;num delta-neutral&quot;&gt;+1.3 / +0.0&lt;/td&gt;
      &lt;/tr&gt;
    &lt;/tbody&gt;
  &lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;The temporal-reasoning lift on EventQA grows with context length (+1.6, +4.2, +6.4 from 64K to 800K), which we read as evidence that learning to implicitly track Hanabi state generalizes to long-horizon belief tracking elsewhere. AIME stays flat, with no catastrophic forgetting on math.&lt;/p&gt;

&lt;h2 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Modern reasoning LLMs show sparks of cooperative reasoning, but reliable multi-agent coordination remains unsolved.&lt;/strong&gt; The best score ~15 to 18/25 in self-play, comfortably below specialized agents (&amp;gt;23) and the median human Hanabi player (~18 to 21).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scaffold design matters more than model scale.&lt;/strong&gt; Moving from Watson to Sherlock improves reasoning models by +2.0 on average; the same scaffold &lt;em&gt;hurts&lt;/em&gt; most non-reasoning models. Different families respond differently to identical context.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Implicit state tracking is the open problem.&lt;/strong&gt; Even o3 drops 1.2 points moving from engine-provided deductions to self-tracking; Gemini 2.5 Pro drops 3.7. Multi-turn belief maintenance is where current models break.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cross-play is graceful.&lt;/strong&gt; Unlike specialized RL agents, LLMs interpolate smoothly between weak and strong teammates, showing a small but real “spark” of cooperative generalization.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;A 4B model can carry surprising weight.&lt;/strong&gt; Post-training on our datasets closes most of the gap to frontier reasoning models on Hanabi &lt;em&gt;and&lt;/em&gt; transfers to temporal reasoning, instruction following, and out-of-domain cooperation.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;div class=&quot;citation-block&quot;&gt;
  &lt;div class=&quot;citation-header&quot;&gt;
    &lt;span class=&quot;lbl&quot;&gt;BibTeX&lt;/span&gt;
    &lt;button&gt;&lt;span&gt;Copy&lt;/span&gt;&lt;/button&gt;
  &lt;/div&gt;
&lt;pre&gt;@misc{ramesh2026sparkscooperativereasoningllms,
      title={Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents}, 
      author={Mahesh Ramesh and Kaousheik Jayakumar and Aswinkumar Ramkumar and Pavan Thodima and Aniket Rege and Emmanouil-Vasileios Vlatakis-Gkaragkounis},
      year={2026},
      eprint={2601.18077},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.18077}, 
}&lt;/pre&gt;
&lt;/div&gt;
</description>
        <pubDate>Sat, 21 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://kaousheik-26.github.io//llms-hanabi-cooperative-reasoning/</link>
        <guid isPermaLink="true">https://kaousheik-26.github.io//llms-hanabi-cooperative-reasoning/</guid>
        
        
      </item>
    
      <item>
        <title>Where Does the Sound Go? Probing Audio-Visual Language Models for Modality Bias</title>
        <description>&lt;h2 id=&quot;motivation&quot;&gt;Motivation&lt;/h2&gt;

&lt;p&gt;A new generation of audio-visual large language models — Gemini, GPT-4o’s audio mode, Qwen2-Audio, and several open-weights successors — markets itself as &lt;em&gt;truly multimodal&lt;/em&gt;: pass in a video clip and they will answer questions about both what’s on the screen and what’s on the soundtrack. The marketing demos are compelling. A car horn off-screen, a dog barking behind the camera, a violinist tuning before the visual cut — these are exactly the kinds of cases where audio carries information vision can’t.&lt;/p&gt;

&lt;p&gt;But how much of the answer actually comes from the audio? When the model says “the woman is playing a violin,” is that because it heard the bowing or because it saw the instrument? When you swap the soundtrack for white noise, does the answer change at all?&lt;/p&gt;

&lt;p&gt;This work probes audio-visual LLMs for &lt;strong&gt;modality bias&lt;/strong&gt; — specifically, how much weight the model genuinely places on the audio stream when both modalities are available. We find a sharp and consistent pattern: across four open-weights AV-LLMs and three closed APIs, &lt;strong&gt;vision dominates the prediction in roughly 87% of cases where audio and vision disagree&lt;/strong&gt;. We then trace where in the network the audio signal gets attenuated, and find the bottleneck is concentrated in the cross-modal projection layers, not in the audio encoder itself.&lt;/p&gt;

&lt;h2 id=&quot;qualitative-examples&quot;&gt;Qualitative examples&lt;/h2&gt;

&lt;p&gt;Before the numbers, a flavor of what we mean. Consider a clip of someone slicing a cucumber on a wooden cutting board. The visual is unambiguous — knife, cucumber, board. The audio is the percussive &lt;em&gt;thock-thock&lt;/em&gt; of blade on wood. Now we replace the audio with the sound of a violin tuning, and ask the model: &lt;em&gt;“What is happening in this video?”&lt;/em&gt; Every model we tested answered some variant of “a person is slicing a cucumber” — the violin sound was completely ignored.&lt;/p&gt;

&lt;p&gt;The mirror experiment is equally telling. We take a clip of a violinist mid-performance, mute the violin, and dub in cucumber-chopping audio. The models still describe a violinist. Vision wins both times. The audio stream might as well not exist for these examples.&lt;/p&gt;

&lt;p&gt;The full paper has a gallery of around 60 such pairs, organized by the type of audio-visual conflict (object identity, action, environment, speaker characteristics). The pattern is remarkably consistent across model families.&lt;/p&gt;

&lt;h2 id=&quot;how-we-study-this&quot;&gt;How we study this&lt;/h2&gt;

&lt;h3 id=&quot;counterfactual-probes&quot;&gt;Counterfactual probes&lt;/h3&gt;

&lt;p&gt;To measure modality reliance directly, we built a dataset of &lt;strong&gt;2,400 counterfactual video pairs&lt;/strong&gt;. Each pair shares one modality and swaps the other: same video, two soundtracks; or same soundtrack, two videos. We then ask each model the same open-ended question about the clip and measure how often its answer flips when we swap the audio versus when we swap the video.&lt;/p&gt;

&lt;p&gt;A model that genuinely fuses both modalities should produce different answers for the two audio conditions in cases where the audio is informative. A model that ignores audio will produce identical answers regardless of what’s on the soundtrack. The ratio of these two flip rates is what we call the &lt;strong&gt;modality reliance ratio&lt;/strong&gt;, and across all seven models we tested, it’s heavily skewed toward vision: typical values land between 0.08 and 0.15, meaning audio swaps change the answer roughly an order of magnitude less often than video swaps do.&lt;/p&gt;

&lt;h3 id=&quot;layer-wise-attribution&quot;&gt;Layer-wise attribution&lt;/h3&gt;

&lt;p&gt;The flip-rate experiments tell us &lt;em&gt;that&lt;/em&gt; vision dominates. To understand &lt;em&gt;where&lt;/em&gt; the audio signal gets dropped, we run gradient-based attribution at every layer of the model, following the audio token contributions from the audio encoder all the way through the cross-modal projector and into the language model’s residual stream.&lt;/p&gt;

&lt;p&gt;The picture that emerges is striking. Inside the audio encoder, audio tokens carry meaningful, distinguishable representations — different sounds produce different embeddings, and a linear probe can recover the underlying class with high accuracy. The information is there. But once those tokens pass through the cross-modal projection layer that maps them into the language model’s embedding space, their gradient contribution to the final answer drops by roughly 70%. By the time the signal reaches the LLM’s middle layers, audio tokens are contributing less than 5% of the residual stream norm at the answer position.&lt;/p&gt;

&lt;p&gt;In other words: the audio encoder is doing its job. The language model is mostly ignoring its output.&lt;/p&gt;

&lt;h2 id=&quot;findings&quot;&gt;Findings&lt;/h2&gt;

&lt;h3 id=&quot;does-the-model-pay-attention-to-audio&quot;&gt;Does the model pay attention to audio&lt;/h3&gt;

&lt;p&gt;Attention rollout from the answer token back to the input shows that audio tokens receive between 2% and 8% of total attention mass across the seven models we tested, while vision tokens receive between 60% and 80%. This is roughly proportional to token count (vision contributes more tokens), but per-token attention is still 2–3× higher for vision tokens. The model is not weighting the modalities equally.&lt;/p&gt;

&lt;h3 id=&quot;are-audio-representations-meaningful&quot;&gt;Are audio representations meaningful&lt;/h3&gt;

&lt;p&gt;A natural worry: maybe the audio encoder is just bad. We rule this out with linear probes. A simple linear classifier trained on the audio encoder’s output can distinguish 80+ environmental sound classes with above-90% accuracy, and can identify speaker gender with above-95%. The representations are rich and well-separated. The bottleneck is downstream.&lt;/p&gt;

&lt;h3 id=&quot;how-does-cross-modal-information-flow&quot;&gt;How does cross-modal information flow&lt;/h3&gt;

&lt;p&gt;Following audio tokens through the cross-modal projector, we observe a dramatic compression. The cosine similarity between input audio embeddings and the projected versions used by the LLM drops to around 0.3 — meaning the projector is largely overwriting the audio encoder’s structure with whatever the LLM expects to receive. We hypothesize this is because the projector was trained on vision-heavy data and learned a mapping that’s effectively a noise channel for audio.&lt;/p&gt;

&lt;h3 id=&quot;where-does-the-vision-bias-originate&quot;&gt;Where does the vision bias originate&lt;/h3&gt;

&lt;p&gt;To pin down whether the bias is learned or architectural, we re-trained the cross-modal projector on a balanced audio-visual dataset where audio is the only informative signal in 50% of examples. The bias drops substantially: modality reliance ratio rises from ~0.12 to ~0.41 after just a few thousand fine-tuning steps. The vision bias is not architectural — it’s a training-data artifact that can be partially undone with targeted data, but only by deliberately oversampling audio-critical examples.&lt;/p&gt;

&lt;h2 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h2&gt;

&lt;p&gt;Two things to walk away with. First, “multimodal” is doing a lot of unverified work in current AV-LLM marketing. These models &lt;em&gt;can&lt;/em&gt; process audio, but they mostly &lt;em&gt;don’t&lt;/em&gt;. Anyone deploying them in settings where audio is safety-critical — accessibility tools for blind users, audio-based anomaly detection, anything where the soundtrack carries information the visuals don’t — should test for this bias before trusting the output.&lt;/p&gt;

&lt;p&gt;Second, the bias is fixable. The audio encoder is competent. The projector is the bottleneck. Targeted fine-tuning on audio-critical examples meaningfully shifts the modality reliance ratio. There’s no architectural reason these models have to be vision-blind to audio — we just trained them that way.&lt;/p&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;div class=&quot;citation-block&quot;&gt;
&lt;span class=&quot;cite-label&quot;&gt;BibTeX&lt;/span&gt;
&lt;pre&gt;@article{jayakumar2026where,
  title={Where Does the Sound Go? Probing Audio-Visual Language Models for Modality Bias},
  author={Jayakumar, Kaousheik and Rege, Aniket and Ramesh, Mahesh},
  journal={arXiv preprint},
  year={2026}
}&lt;/pre&gt;
&lt;/div&gt;
</description>
        <pubDate>Sat, 21 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://kaousheik-26.github.io//2026/03/21/audio-visual-interpretability.html</link>
        <guid isPermaLink="true">https://kaousheik-26.github.io//2026/03/21/audio-visual-interpretability.html</guid>
        
        
      </item>
    
  </channel>
</rss>
