Research

My research focuses on advancing multimodal intelligence, with a core emphasis on audio—spanning speech, sounds, and music. I work on challenges such as developing data- and compute-efficient models, improving multimodal representation learning, and enhancing perception and reasoning in AI systems.

In my early work, I explored resource-efficient deep learning, proposing methods for training models under constraints of limited labeled/unlabeled data and compute. This included techniques such as synthetic data augmentation and self-supervised learning to enable more effective downstream learning.

Currently, my research is directed toward building omni-intelligence, developing Large Multimodal Models (LMMs) that seamlessly integrate audio, language, vision, and beyond. I focus on advancing architectures, scalable synthetic data pipelines, and cross-modal reasoning to move toward more general-purpose AI systems. My publications span diverse areas in multimodal AI, including natural language understanding, audio understanding, audio generation, compositional reasoning, Large Audio-Language Models (LALMs), and multimodal pre-training and fine-tuning.

Google Scholar

Semantic Scholar

Selected Works

Audio Processing (Speech, Sound & Music)

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos
Sreyan Ghosh*, Arushi Goel*, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Siddharth Gururani, Hanrong Ye, Pritam Biswas, Yuanhang Su, Ehsan Hosseini-Asl, Sang-gil Lee, Zhifeng Kong, Jaehyeon Kim, Sungwon Kim, S Sakshi, Ramani Duraiswami, Dinesh Manocha, Andrew Tao, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, Wei Ping
Project Page
arXiv 2026
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
Sreyan Ghosh*, Arushi Goel*, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, Wei Ping
Project Page
arXiv 2026
MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos
Arushi Goel*, Sreyan Ghosh*, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar, Lasha Koroshinadze, Yao Xu, Katie Lyons, James Case, Karan Sapra, Kevin J. Shih, Siddharth Gururani, Abhinav Shrivastava, Ramani Duraiswami, Dinesh Manocha, Andrew Tao, Bryan Catanzaro, Mohammad Shoeybi, Wei Ping
Data / Project Page
arXiv 2026
Unified Audio Intelligence Without Regressing on Text Intelligence
Nvidia ADLR Audio Team
Project Page
Technical Report 2026
Cosmos 3: Omnimodal World Models for Physical AI
Nvidia Cosmos Team
Project Page
Technical Report 2026
Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence
Nvidia ADLR Team
Model
Technical Report 2026
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Sreyan Ghosh*, Arushi Goel*, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
Project Page / Code / Demo / Checkpoints and Data / Video Demo / Tweet / Slides / Poster / Coverage
NeurIPS 2025 (Spotlight)
Music Flamingo: Scaling Music Understanding in Audio Language Models
Sreyan Ghosh*, Arushi Goel*, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro
Project Page / Code / Demo / Checkpoints and Data / Coverage / UMG-Nvidia Collab Press Release
ICLR 2026
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro
Project Page / Code / Demo / Tweet / Slides / Poster / Coverage 1 / Coverage 2 / Coverage 3
ICML 2025
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Sreyan Ghosh*, Sonal Kumar*, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Project Website / Slides / Summary Tweet / Poster / Coverage 1 / Coverage 2
EMNLP 2024 (Oral)
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
Jinchuan Tian, Sang-gil Lee, Zhifeng Kong, Sreyan Ghosh, Arushi Goel, Chao-Han Huck Yang, Wenliang Dai, Zihan Liu, Hanrong Ye, Shinji Watanabe, Mohammad Shoeybi, Bryan Catanzaro, Rafael Valle, Wei Ping
Project Page / Code
ICLR 2026 (Oral)
MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
Sonal Kumar et al., Sreyan Ghosh, Ramani Duraiswami
Project Website / Poster
AAAI 2026
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmarks
S Sakshi*, Utkarsh Tyagi*, Sonal Kumar*, Ashish Seth*, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh*, Dinesh Manocha
Project Website / Slides / Poster / Talk
ICLR 2025 (Spotlight)

Natural Language and Visual Processing

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar*, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha
Project / Slides / Summary Tweet
ICLR 2025
A Closer Look at the Limitations of Instruction Tuning
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar*, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha
Summary Tweet / Poster / Slides / Video
ICML 2024
ABEX: Data Augmentation for Low-Resource NLU via Expanding Abstract Descriptions
Sreyan Ghosh*, Utkarsh Tyagi*, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramaneswaran S, S. Sakshi, Dinesh Manocha
Code / Poster
ACL 2024

Other Papers

Audio Processing (Speech, Sound & Music)

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Yao Lu, Oluwatobi Olabiyi, Yu-Chiang Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov
Project Page / Code / Checkpoints
ICLR 2026
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li
ACL 2025 (Findings)
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
Code / Text-to-Audio Demo / Slides
ICLR 2025
MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions
Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
Code / Poster / Slides
EMNLP 2025
EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvakumar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, Dinesh Manocha
Project Website / Poster
EMNLP 2025
Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning
Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro
Challenge
ICASSP 2026
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Code / Slides
ICASSP 2025 (Oral)
ProSE: Diffusion Priors for Speech Enhancement
Sonal Kumar, *Sreyan Ghosh*, Utkarsh Tyagi, Anton Jeran Ratnarajah, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
Code / Slides / Talk
NAACL 2025 (Oral)
PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
Ashish Seth, Ramaneswaran Selvakumar, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
Code / Talk / Slides
NAACL 2025 (Oral)
Do Audio-Language Models Understand Linguistic Variations?
Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha
Poster
NAACL 2025
EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
Ashish Seth*, Ramaneswaran Selvakumar, S Sakshi, Sonal Kumar, Sreyan Ghosh*, Dinesh Manocha
GitHub / Poster / Talk
EMNLP 2024 (Oral)
LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
Sreyan Ghosh*, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha
Code / Slides / Talk
InterSpeech 2024 (Oral)
AV-RIR: Audio-Visual Room Impulse Response Estimation
Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, Purva Chiniya, Dinesh Manocha
Project Website / Poster
CVPR 2024
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh*, Ashish Seth*, Sonal Kumar*, Utkarsh Tyagi*, Chandra Kiran Reddy Evuru*, Ramaneswaran S, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Project Webiste / Slides / Poster
ICLR 2024
RECAP: Retrieval-Augmented Audio Captioning
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
Code / Slides
ICASSP 2024 (Oral)
Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2024
FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2024
AdVerb: Visually Guided Audio Dereverberation
Sanjoy Chowdhury*, Sreyan Ghosh*, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
Poster
ICCV 2023
MMER: Multimodal Multi-task Learning for Speech Emotion Recognition
Sreyan Ghosh, Utkarsh Tyagi, Ramaneswaran S, Harshvardhan Srivastava, Dinesh Manocha
Code / Slides
Interspeech 2023 (Oral)
Decorrelating Feature Spaces for Learning General Purpose Audio Representations
Sreyan Ghosh*, Ashish Seth*, S. Umesh
Code / Poster
IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing
ICASSP 2023
data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup
Lodagala V S V Durga Prasad*, Sreyan Ghosh*, S. Umesh
Code / Leaderboard
ICASSP 2023 (Oral)
MAST: Multiscale Audio Spectrogram Transformers
Sreyan Ghosh*, Ashish Seth*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2023
SLICER: Learning universal audio representations using low-resource self-supervised pre-training
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2023
PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations
Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh
Code
IEEE SLT 2022
CCC-WAV2VEC 2.0: Clustering aided cross contrastive self-supervised learning of speech representations
Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh
Code / Leaderboard
IEEE SLT 2022
Span Classification with Structured Information for Disfluency Detection in Spoken Utterances
Sreyan Ghosh, Sonal Kumar, Yaman Kumar Singla, Rajiv Ratn Shah, S. Umesh
Code
Interspeech 2022 (Oral)
DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances
Sreyan Ghosh, Samden Lepcha, Sakshi, Rajiv Ratn Shah, S. Umesh
Code / Data
Interspeech 2022
End-to-end Named Entity Recognition from English Speech
Hemant Yadav, Sreyan Ghosh, Yi Yu, Rajiv Ratn Shah
Code / Data
Interspeech 2020

Natural Language and Visual Processing

ASPIRE: Language-Guided Augmentation for Robust Image Classification
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar*, S. Sakshi, Utkarsh Tyagi, Dinesh Manocha
Code / Slides / Talk / Poster
ACL 2024 Findings
Do Vision-Language Models Understand Compound Nouns?
Sonal Kumar*, Sreyan Ghosh*, S Sakshi, Utkarsh Tyagi, Dinesh Manocha
Code / Slides / Talk / Poster
NAACL 2024
CoDa: Constrained Generation based Data Augmentation for Low-Resource NLP
Chandra Kiran Reddy Evuru*, Sreyan Ghosh*, Sonal Kumar, Ramaneswaran S, Utkarsh Tyagi, Dinesh Manocha
Code / Talk / Poster
NAACL 2024 Findings
DALE: Generative Data Augmentation for Low-Resource Legal NLP
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar, Ramaneswaran S, S Sakshi, Utkarsh Tyagi, Dinesh Manocha
Code / Poster
EMNLP 2023
CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a Context Synergized Hyperbolic Network
Sreyan Ghosh*, Manan Suri*, Purva Chiniya*, Utkarsh Tyagi*, Sonal Kumar*, Dinesh Manocha
Code / Poster
EMNLP 2023
ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER
Sreyan Ghosh*, Utkarsh Tyagi*, Manan Suri, Sonal Kumar, Ramaneswaran S, Dinesh Manocha
Code / Poster
ACL 2023
BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER
Sreyan Ghosh*, Utkarsh Tyagi*, Sonal Kumar*, Dinesh Manocha
Code / Poster
SIGIR 2023

Workshop

UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2023 SASB Workshop
DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning
Sreyan Ghosh, Ashish Seth, Deepak Mittal, Maneesh Singh, S. Umesh
Code
SAS Workshop @ AAAI 2022
Leveraging Transformers for Hate Speech Detection in Conversational Code-Mixed Tweets
Zaki Mustafa Farooqi, Sreyan Ghosh, Rajiv Ratn Shah
Leader Board (Team Name: MIDAS@IIIT-D)
FIRE 2021
Cisco at SemEval-2021 Task 5: What’s Toxic?: Leveraging Transformers for Multiple Toxic Span Extraction from Online Comments
Sreyan Ghosh, Sonal Kumar
Code
SemEval-2021 @ ACL 2021
Cisco at AAAI-CAD21 shared task: Predicting Emphasis in Presentation Slides using Contextualized Embeddings
Sreyan Ghosh, Sonal Kumar, Harsh Jalan, Hemant Yadav, Rajiv Ratn Shah
Code
CAD-21 @ AAAI 2021

Pre-prints

Deep Clustering for learning general-purpose Audio Representations
Sreyan Ghosh*, Ashish Seth*, Sandesh Katta*, S. Umesh
Code
Pre-print
Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition
Lodagala V S V Durga Prasad*, Ashish Seth*, Sreyan Ghosh*, S. Umesh
Pre-print