Research

My research focuses on advancing multimodal intelligence, with a core emphasis on audio—spanning speech, sounds, and music. I work on challenges such as developing data- and compute-efficient models, improving multimodal representation learning, and enhancing perception and reasoning in AI systems.

In my early work, I explored resource-efficient deep learning, proposing methods for training models under constraints of limited labeled/unlabeled data and compute. This included techniques such as synthetic data augmentation and self-supervised learning to enable more effective downstream learning.

Currently, my research is directed toward building omni-intelligence, developing Large Multimodal Models (LMMs) that seamlessly integrate audio, language, vision, and beyond. I focus on advancing architectures, scalable synthetic data pipelines, and cross-modal reasoning to move toward more general-purpose AI systems. My publications span diverse areas in multimodal AI, including natural language understanding, audio understanding, audio generation, compositional reasoning, Large Audio-Language Models (LALMs), and multimodal pre-training and fine-tuning.

Google Scholar

Semantic Scholar

Pre-prints

Deep Clustering for learning general-purpose Audio Representations
Sreyan Ghosh*, Ashish Seth*, Sandesh Katta*, S. Umesh
Code
Pre-print
Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition
Lodagala V S V Durga Prasad*, Ashish Seth*, Sreyan Ghosh*, S. Umesh
Pre-print

Audio and Spoken Language Processing (Chronological)

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Sreyan Ghosh, Arushi Goel, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
Project Page / Code / Demo / Checkpoints and Data / Video Demo / Tweet / Slides / Coverage
NeurIPS 2025 (Spotlight)
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro
Project Page / Code / Demo / Tweet / Slides / Poster / Coverage 1 / Coverage 2 / Coverage 3
ICML 2025
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Sreyan Ghosh*, Sonal Kumar*, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Project Website / Slides / Summary Tweet / Coverage 1 / Coverage 2
EMNLP 2024 (Oral)
MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
Sonal Kumar et al., Sreyan Ghosh, Ramani Duraiswami
Project Website
Under Review
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmarks
S Sakshi*, Utkarsh Tyagi*, Sonal Kumar*, Ashish Seth*, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh*, Dinesh Manocha
Project Website / Slides / Talk
ICLR 2025 (Spotlight)
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li
ACL 2025 (Findings)
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
Code / Text-to-Audio Demo / Slides
ICLR 2025
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Code / Slides
ICASSP 2025 (Oral)
ProSE: Diffusion Priors for Speech Enhancement
Sonal Kumar, Sreyan Ghosh, Utkarsh Tyagi, Anton Jeran Ratnarajah, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
Code / Slides / Talk
NAACL 2025 (Oral)
PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
Ashish Seth, Ramaneswaran Selvakumar, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
Code / Talk / Slides
NAACL 2025 (Oral)
Do Audio-Language Models Understand Linguistic Variations?
Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha
NAACL 2025
EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
Ashish Seth*, Ramaneswaran Selvakumar, S Sakshi, Sonal Kumar, Sreyan Ghosh*, Dinesh Manocha
GitHub
EMNLP 2024 (Oral)
LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
Sreyan Ghosh*, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha
Code / Slides / Talk
InterSpeech 2024 (Oral)
AV-RIR: Audio-Visual Room Impulse Response Estimation
Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, Purva Chiniya, Dinesh Manocha
Project Website / Poster
CVPR 2024
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh*, Ashish Seth*, Sonal Kumar*, Utkarsh Tyagi*, Chandra Kiran Reddy Evuru*, Ramaneswaran S, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Project Webiste / Slides / Poster
ICLR 2024
RECAP: Retrieval-Augmented Audio Captioning
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
Code / Slides
ICASSP 2024 (Oral)
Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2024
FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2024
AdVerb: Visually Guided Audio Dereverberation
Sanjoy Chowdhury*, Sreyan Ghosh*, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
Poster
ICCV 2023
MMER: Multimodal Multi-task Learning for Speech Emotion Recognition
Sreyan Ghosh, Utkarsh Tyagi, Ramaneswaran S, Harshvardhan Srivastava, Dinesh Manocha
Code / Slides
Interspeech 2023 (Oral)
Decorrelating Feature Spaces for Learning General Purpose Audio Representations
Sreyan Ghosh*, Ashish Seth*, S. Umesh
Code / Poster
IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing
ICASSP 2023
data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup
Lodagala V S V Durga Prasad*, Sreyan Ghosh*, S. Umesh
Code / Leaderboard
ICASSP 2023 (Oral)
MAST: Multiscale Audio Spectrogram Transformers
Sreyan Ghosh*, Ashish Seth*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2023
SLICER: Learning universal audio representations using low-resource self-supervised pre-training
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2023
PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations
Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh
Code
IEEE SLT 2022
CCC-WAV2VEC 2.0: Clustering aided cross contrastive self-supervised learning of speech representations
Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh
Code / Leaderboard
IEEE SLT 2022
Span Classification with Structured Information for Disfluency Detection in Spoken Utterances
Sreyan Ghosh, Sonal Kumar, Yaman Kumar Singla, Rajiv Ratn Shah, S. Umesh
Code
Interspeech 2022 (Oral)
DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances
Sreyan Ghosh, Samden Lepcha, Sakshi, Rajiv Ratn Shah, S. Umesh
Code / Data
Interspeech 2022
End-to-end Named Entity Recognition from English Speech
Hemant Yadav, Sreyan Ghosh, Yi Yu, Rajiv Ratn Shah
Code / Data
Interspeech 2020

Natural Language Processing (Chronological)

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar*, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha
Project / Slides / Summary Tweet
ICLR 2025
ABEX: Data Augmentation for Low-Resource NLU via Expanding Abstract Descriptions
Sreyan Ghosh*, Utkarsh Tyagi*, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramaneswaran S, S. Sakshi, Dinesh Manocha
Code / Poster
ACL 2024
ASPIRE: Language-Guided Augmentation for Robust Image Classification
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar*, S. Sakshi, Utkarsh Tyagi, Dinesh Manocha
Code / Slides / Talk / Poster
ACL 2024 Findings
A Closer Look at the Limitations of Instruction Tuning
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar*, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha
Summary Tweet / Poster / Slides / Video
ICML 2024
Do Vision-Language Models Understand Compound Nouns?
Sonal Kumar*, Sreyan Ghosh*, S Sakshi, Utkarsh Tyagi, Dinesh Manocha
Code / Slides / Talk / Poster
NAACL 2024
CoDa: Constrained Generation based Data Augmentation for Low-Resource NLP
Chandra Kiran Reddy Evuru*, Sreyan Ghosh*, Sonal Kumar, Ramaneswaran S, Utkarsh Tyagi, Dinesh Manocha
Code / Talk / Poster
NAACL 2024 Findings
DALE: Generative Data Augmentation for Low-Resource Legal NLP
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar, Ramaneswaran S, S Sakshi, Utkarsh Tyagi, Dinesh Manocha
Code / Poster
EMNLP 2023
CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a Context Synergized Hyperbolic Network
Sreyan Ghosh*, Manan Suri*, Purva Chiniya*, Utkarsh Tyagi*, Sonal Kumar*, Dinesh Manocha
Code / Poster
EMNLP 2023
ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER
Sreyan Ghosh*, Utkarsh Tyagi*, Manan Suri, Sonal Kumar, Ramaneswaran S, Dinesh Manocha
Code / Poster
ACL 2023
BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER
Sreyan Ghosh*, Utkarsh Tyagi*, Sonal Kumar*, Dinesh Manocha
Code / Poster
SIGIR 2023

Workshop

UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2023 SASB Workshop
DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning
Sreyan Ghosh, Ashish Seth, Deepak Mittal, Maneesh Singh, S. Umesh
Code
SAS Workshop @ AAAI 2022
Leveraging Transformers for Hate Speech Detection in Conversational Code-Mixed Tweets
Zaki Mustafa Farooqi, Sreyan Ghosh, Rajiv Ratn Shah
Leader Board (Team Name: MIDAS@IIIT-D)
FIRE 2021
Cisco at SemEval-2021 Task 5: What’s Toxic?: Leveraging Transformers for Multiple Toxic Span Extraction from Online Comments
Sreyan Ghosh, Sonal Kumar
Code
SemEval-2021 @ ACL 2021
Cisco at AAAI-CAD21 shared task: Predicting Emphasis in Presentation Slides using Contextualized Embeddings
Sreyan Ghosh, Sonal Kumar, Harsh Jalan, Hemant Yadav, Rajiv Ratn Shah
Code
CAD-21 @ AAAI 2021