My research focuses on advancing multimodal intelligence, with a core emphasis on audio—spanning speech, sounds, and music. I work on challenges such as developing data- and compute-efficient models, improving multimodal representation learning, and enhancing perception and reasoning in AI systems.

In my early work, I explored resource-efficient deep learning, proposing methods for training models under constraints of limited labeled/unlabeled data and compute. This included techniques such as synthetic data augmentation and self-supervised learning to enable more effective downstream learning.

Currently, my research is directed toward building omni-intelligence, developing Large Multimodal Models (LMMs) that seamlessly integrate audio, language, vision, and beyond. I focus on advancing architectures, scalable synthetic data pipelines, and cross-modal reasoning to move toward more general-purpose AI systems. My publications span diverse areas in multimodal AI, including natural language understanding, audio understanding, audio generation, compositional reasoning, Large Audio-Language Models (LALMs), and multimodal pre-training and fine-tuning.

Google Scholar Semantic Scholar

Pre-prints

Audio and Spoken Language Processing (Chronological)

Natural Language Processing (Chronological)

Workshop