Research
My research focuses on advancing multimodal intelligence, with a core emphasis on audio—spanning speech, sounds, and music. I work on challenges such as developing data- and compute-efficient models, improving multimodal representation learning, and enhancing perception and reasoning in AI systems.
In my early work, I explored resource-efficient deep learning, proposing methods for training models under constraints of limited labeled/unlabeled data and compute. This included techniques such as synthetic data augmentation and self-supervised learning to enable more effective downstream learning.
Currently, my research is directed toward building omni-intelligence, developing Large Multimodal Models (LMMs) that seamlessly integrate audio, language, vision, and beyond. I focus on advancing architectures, scalable synthetic data pipelines, and cross-modal reasoning to move toward more general-purpose AI systems. My publications span diverse areas in multimodal AI, including natural language understanding, audio understanding, audio generation, compositional reasoning, Large Audio-Language Models (LALMs), and multimodal pre-training and fine-tuning.
Google Scholar Semantic Scholar
Pre-prints
-
Deep Clustering for learning general-purpose Audio Representations
Sreyan Ghosh*, Ashish Seth*, Sandesh Katta*, S. Umesh
Code
Pre-print -
Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition
Lodagala V S V Durga Prasad*, Ashish Seth*, Sreyan Ghosh*, S. Umesh
Pre-print
Audio and Spoken Language Processing (Chronological)
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
Project Page / Code / Demo / Checkpoints and Data / Video Demo / Tweet / Coverage
Under Review -
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro
Project Page / Code / Demo / Tweet / Coverage 1 / Coverage 2 / Coverage 3
ICML 2025 -
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Sreyan Ghosh*, Sonal Kumar*, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Project Website / Slides / Summary Tweet / Coverage 1 / Coverage 2
EMNLP 2024 (Oral) -
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmarks
S Sakshi*, Utkarsh Tyagi*, Sonal Kumar*, Ashish Seth*, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh*, Dinesh Manocha
Project Website / Slides / Talk
ICLR 2025 (Spotlight) -
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li
ACL 2025 (Findings) -
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
Code / Text-to-Audio Demo / Slides
ICLR 2025 -
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Code / Slides
ICASSP 2025 (Oral) -
ProSE: Diffusion Priors for Speech Enhancement
Sonal Kumar, Sreyan Ghosh, Utkarsh Tyagi, Anton Jeran Ratnarajah, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
Code / Slides / Talk
NAACL 2025 (Oral) -
PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
Ashish Seth, Ramaneswaran Selvakumar, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
Code / Talk / Slides
NAACL 2025 (Oral) -
Do Audio-Language Models Understand Linguistic Variations?
Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha
NAACL 2025 -
EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
Ashish Seth*, Ramaneswaran Selvakumar, S Sakshi, Sonal Kumar, Sreyan Ghosh*, Dinesh Manocha
GitHub
EMNLP 2024 (Oral) -
LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
Sreyan Ghosh*, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha
Code / Slides / Talk
InterSpeech 2024 (Oral) -
AV-RIR: Audio-Visual Room Impulse Response Estimation
Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, Purva Chiniya, Dinesh Manocha
Project Website / Poster
CVPR 2024 -
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh*, Ashish Seth*, Sonal Kumar*, Utkarsh Tyagi*, Chandra Kiran Reddy Evuru*, Ramaneswaran S, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Project Webiste / Slides / Poster
ICLR 2024 -
RECAP: Retrieval-Augmented Audio Captioning
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
Code / Slides
ICASSP 2024 (Oral) -
Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2024 -
FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2024 -
AdVerb: Visually Guided Audio Dereverberation
Sanjoy Chowdhury*, Sreyan Ghosh*, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
Poster
ICCV 2023 -
MMER: Multimodal Multi-task Learning for Speech Emotion Recognition
Sreyan Ghosh, Utkarsh Tyagi, Ramaneswaran S, Harshvardhan Srivastava, Dinesh Manocha
Code / Slides
Interspeech 2023 (Oral) -
Decorrelating Feature Spaces for Learning General Purpose Audio Representations
Sreyan Ghosh*, Ashish Seth*, S. Umesh
Code / Poster
IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing
ICASSP 2023 -
data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup
Lodagala V S V Durga Prasad*, Sreyan Ghosh*, S. Umesh
Code / Leaderboard
ICASSP 2023 (Oral) -
MAST: Multiscale Audio Spectrogram Transformers
Sreyan Ghosh*, Ashish Seth*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2023 -
SLICER: Learning universal audio representations using low-resource self-supervised pre-training
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2023 -
PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations
Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh
Code
IEEE SLT 2022 -
CCC-WAV2VEC 2.0: Clustering aided cross contrastive self-supervised learning of speech representations
Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh
Code / Leaderboard
IEEE SLT 2022 -
Span Classification with Structured Information for Disfluency Detection in Spoken Utterances
Sreyan Ghosh, Sonal Kumar, Yaman Kumar Singla, Rajiv Ratn Shah, S. Umesh
Code
Interspeech 2022 (Oral) -
DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances
Sreyan Ghosh, Samden Lepcha, Sakshi, Rajiv Ratn Shah, S. Umesh
Code / Data
Interspeech 2022 -
End-to-end Named Entity Recognition from English Speech
Hemant Yadav, Sreyan Ghosh, Yi Yu, Rajiv Ratn Shah
Code / Data
Interspeech 2020
Natural Language Processing (Chronological)
-
Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar*, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha
Project / Slides / Summary Tweet
ICLR 2025 -
ABEX: Data Augmentation for Low-Resource NLU via Expanding Abstract Descriptions
Sreyan Ghosh*, Utkarsh Tyagi*, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramaneswaran S, S. Sakshi, Dinesh Manocha
Code / Poster
ACL 2024 -
ASPIRE: Language-Guided Augmentation for Robust Image Classification
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar*, S. Sakshi, Utkarsh Tyagi, Dinesh Manocha
Code / Slides / Talk / Poster
ACL 2024 Findings -
A Closer Look at the Limitations of Instruction Tuning
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar*, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha
Summary Tweet / Poster / Slides / Video
ICML 2024 -
Do Vision-Language Models Understand Compound Nouns?
Sonal Kumar*, Sreyan Ghosh*, S Sakshi, Utkarsh Tyagi, Dinesh Manocha
Code / Slides / Talk / Poster
NAACL 2024 -
CoDa: Constrained Generation based Data Augmentation for Low-Resource NLP
Chandra Kiran Reddy Evuru*, Sreyan Ghosh*, Sonal Kumar, Ramaneswaran S, Utkarsh Tyagi, Dinesh Manocha
Code / Talk / Poster
NAACL 2024 Findings -
DALE: Generative Data Augmentation for Low-Resource Legal NLP
Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar, Ramaneswaran S, S Sakshi, Utkarsh Tyagi, Dinesh Manocha
Code / Poster
EMNLP 2023 -
CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a Context Synergized Hyperbolic Network
Sreyan Ghosh*, Manan Suri*, Purva Chiniya*, Utkarsh Tyagi*, Sonal Kumar*, Dinesh Manocha
Code / Poster
EMNLP 2023 -
ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER
Sreyan Ghosh*, Utkarsh Tyagi*, Manan Suri, Sonal Kumar, Ramaneswaran S, Dinesh Manocha
Code / Poster
ACL 2023 -
BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER
Sreyan Ghosh*, Utkarsh Tyagi*, Sonal Kumar*, Dinesh Manocha
Code / Poster
SIGIR 2023
Workshop
-
UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation
Ashish Seth*, Sreyan Ghosh*, S. Umesh, Dinesh Manocha
Code / Poster
ICASSP 2023 SASB Workshop -
DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning
Sreyan Ghosh, Ashish Seth, Deepak Mittal, Maneesh Singh, S. Umesh
Code
SAS Workshop @ AAAI 2022 -
Leveraging Transformers for Hate Speech Detection in Conversational Code-Mixed Tweets
Zaki Mustafa Farooqi, Sreyan Ghosh, Rajiv Ratn Shah
Leader Board (Team Name: MIDAS@IIIT-D)
FIRE 2021 -
Cisco at SemEval-2021 Task 5: What’s Toxic?: Leveraging Transformers for Multiple Toxic Span Extraction from Online Comments
Sreyan Ghosh, Sonal Kumar
Code
SemEval-2021 @ ACL 2021 -
Cisco at AAAI-CAD21 shared task: Predicting Emphasis in Presentation Slides using Contextualized Embeddings
Sreyan Ghosh, Sonal Kumar, Harsh Jalan, Hemant Yadav, Rajiv Ratn Shah
Code
CAD-21 @ AAAI 2021