Research

My research centers on building multimodal intelligence; machines that can perceive, reason over, and ground language in what they hear and see. Most of my recent work focuses on Large Audio(-Visual)-Language Models: how they fuse audio, vision, and text, where that fusion fails (temporal grounding, modality bias), and how to post-train them for stronger reasoning over long and complex real-world videos.

Alongside this, I study reasoning and reinforcement learning in language agents; what cooperative and strategic capabilities emerge in LLMs, and what actually transfers when RL trains an agent. My earlier work at IIT Madras and on the Govt. of India Bhashini project built robust multilingual speech systems for low-resource Indian languages.

Audio Understanding and Processing

Ramaneswaran Selvakumar*, Kaousheik Jayakumar*, S Sakshi, Sreyan Ghosh, Ruohan Gao, Dinesh Manocha

Do Audio-Visual Large Language Models Really See and Hear?

AVLLMs exhibit a strong vision bias in audio understanding, hallucinating sounds from what they see rather than what they hear. We conduct mechanistic interpretability experiments showing that rich audio semantics exist internally, cross-modal transfer occurs in mid-to-deep layers where vision dominates, and this bias likely stems from vision-centric training.

Do AV LLMs See and Hear
Apoorva Kulkarni, Kaousheik Jayakumar, Sreyan Ghosh, Sarah Wiegreffe, Dinesh Manocha, Ramani Duraiswami

A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models

LALMs consistently fail at foundational temporal tasks — identifying which sound started first, ended last, or lasted longest. We introduce a 1,657-question benchmark for mechanistic diagnosis and find that the problem isn't just modality imbalance: redistributing attention across audio tokens (scaling) outperforms simply increasing audio attention (upweighting). Layer-targeted scaling improves accuracy by 3.2% with no fine-tuning.

Temporal reasoning task examples
Sreyan Ghosh*, Arushi Goel*, Kaousheik Jayakumar, et al. (in collaboration with NVIDIA)

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos

An open audio-visual model built to understand long and complex videos, jointly reasoning over speech, sound, and what's on screen across extended temporal contexts — advancing open audio-visual intelligence beyond short clips.

Audio-Visual Flamingo
Sreyan Ghosh*, Arushi Goel*, Kaousheik Jayakumar, et al. (in collaboration with NVIDIA)

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

The next generation of open audio-language models spanning speech, sound, and music — pushing open audio intelligence with stronger reasoning and broader coverage across the full audio spectrum.

Audio Flamingo Next
Arushi Goel*, Sreyan Ghosh*, Vatsal Agarwal*, Nishit Anand*, Kaousheik Jayakumar, et al. (in collaboration with NVIDIA)

Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long Real-World Videos

A massive multi-task benchmark for omni understanding and reasoning over long, real-world videos, stress-testing models across audio, vision, and language jointly rather than in isolation.

MMOU
Kaousheik Jayakumar, Vrunda N. Sukhadia, A Arunkumar, S Umesh

The Tag-Team Approach: Leveraging CLS and Language Tagging for Enhancing Multilingual ASR

We leverage CLS tokens and explicit language tagging to improve multilingual ASR across Indian languages, boosting recognition for low-resource languages over the corresponding monolingual baselines.

Tag-Team Multilingual ASR

Natural Language Processing

Mahesh Ramesh, Kaousheik Jayakumar, Aswinkumar Ramkumar, Pavan Thodima, Aniket Rege, Emmanouil-Vasileios Vlatakis-Gkaragkounis

Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents

We benchmark 17 LLMs as strategic agents in Hanabi across 2–5 player settings and three scaffolds: Watson, Sherlock, and Mycroft. Our main scaffold, Mycroft, tests whether LLMs can maintain their own evolving belief state across turns without engine-provided deductions. Recent reasoning models show promising cooperative behavior, but still lag behind strong human and specialized Hanabi agents. We also release Hanabi trajectories and move-level judge data for training, and show that a post-trained Qwen3-4B model can substantially close the gap while transferring to other tasks.

Hanabi Cooperative Reasoning
Mahesh Ramesh*, Kaousheik Jayakumar*, Hemanth Ram, Pavan Thodima, Ramani Duraiswami, Dinesh Manocha, Aniket Rege, Emmanouil-Vasileios Vlatakis-Gkaragkounis

Playing with Fire: What Transfers When RL Trains a Language Agent?

We study what actually transfers when reinforcement learning trains a language agent — separating the skills that generalize to out-of-domain tasks from those that overfit to the training environment.

Playing with Fire teaser