UMD Logo

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Adobe Logo
Sreyan Ghosh1*, Chandra Kiran Reddy Evuru1*, Sonal Kumar1*, Utkarsh Tyagi1, Oriol Nieto2, Zeyu Jin2, Dinesh Manocha1
1University of Maryland, 2Adobe
*Equal contribution
Paper Code VaLLu

Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations in LVLMs. Our findings reveal that existing mitigation techniques primarily reduce hallucinations for visual recognition prompts—those that require simple descriptions of visual elements—but fail for cognitive prompts that demand deliberate reasoning.

We identify the core issue as a lack of true visual perception in LVLMs: although they can accurately recognize visual elements, they struggle to fully interpret these elements in the context of the input prompt and effectively link this recognition to their internal knowledge, which is critical for reasoning. To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs.

VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence. Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines by 2% - 33%. Finally, we introduce VaLLu, a benchmark designed for comprehensive evaluation of the cognitive capabilities of LVLMs.

Key Findings

1. While prior techniques work well for visual recognition tasks, they fail when applied to cognitive prompts requiring reasoning.

Visual Recognition vs Cognitive Tasks
Figure 1: (Left) Performance comparison of different LVLMs on various benchmarks. (Right) Performance comparison of different hallucination mitigation techniques applied to LLaVA-1.5.

2. We categorize hallucinations into four types: Language, Vision, Style, and Instruction Tuning (IT). Existing methods only mitigate a subset.

Hallucination Types
Figure 2: Types of Visual Recognition Hallucinations.

3. LVLMs can recognize visual elements but struggle to link them with internal knowledge, leading to incorrect reasoning.

Base Rank Comparison
Figure 3: Base Rank Comparison between AMBER and MATH-Vision datasets as a function of token position in responses (for CogVLM).
Performance comparison
Figure 4: (Left) Performance comparison of different LVLMs when prompted with the original prompt vs. rephrased prompts without image (-t). (Right) Performance comparison of different LVLMs for their ability to generate a faithful image description.

Our Approach

VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence.

VDGD Approach Diagram
Figure 5: Illustration of our proposed VDGD method.

Comparison with Existing Methods

Aspect VDGD Other Baselines
1. Training Requirement Training-free approach that appends a detailed image description and uses KL divergence for robust decoding. Often require additional training, fine-tuning, or specialized modules to address object hallucinations.
2. Scope of Mitigation Targets all forms of hallucinations—especially those in cognitive prompts—by bridging the "visual perception gap." Primarily reduce object-based or "visual recognition" hallucinations; limited efficacy on more complex, reasoning-intensive tasks.
3. Performance Gains Consistently outperforms baselines by 2%–33% on multiple benchmarks, improving both reasoning and recognition accuracy. Show smaller or no gains on cognitive prompts requiring extended reasoning or domain knowledge.

Evaluation Results

We evaluate VDGD on multiple benchmarks, demonstrating improvements of 2%-33% over existing techniques.

Evaluation Results
Figure 6: Evaluation Results across multiple benchmarks.

Qualitative Analysis

We illustrate several instances from VaLLu and compare their responses for LLaVA-1.5 with Greedy, VCD, and VDGD Decoding.

Qualitative example 1
Figure 6: Qualitative Example 1
Qualitative example 2
Figure 7: Qualitative Example 2

The VaLLu Benchmark

VaLLu consists of 1,500 instances sourced from multiple benchmarks, including Oven, MMMU, MMC, MathVista, HallusionBench, MATH-Vision, and MME. It focuses exclusively on open-ended generation tasks, excluding Yes/No and multiple-choice questions, to focus on evaluating diverse forms of hallucination by LVLMs.

The dataset is carefully curated to balance affordability and task diversity, ensuring a comprehensive evaluation. Additionally, VaLLu undergoes manual filtering to remove noisy samples (we find existing benchmarks to have noisy samples as shown below) and is enriched with meta-data annotations and expert-provided responses for high-quality benchmarking.

Task type distribution
Figure 8: Distribution of task types in VaLLu.
Noisy example from HallusionBench
Figure 9: Noisy example from HallusionBench
Noisy example from MathVista
Figure 10: Noisy example from MathVista
Noisy example from MMC
Figure 11: Noisy example from MMC
Noisy example from MMMU
Figure 12: Noisy examples from MMMU