Clicky

EffiVLM-Bench Logo EffiVLM-Bench

A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models

Zekun Wang1, Minghua Ma1*, Zexin Wang1*, Rongchuan Mu1*,
Liping Shan2, Ming Liu1,3, Bing Qin1,3
1Harbin Institute of Technology, 2Du Xiaoman Science Technology Co., Ltd, 3Pengcheng Laboratory
*Equal Contribution
Paper    Code   

Introduction

Large vision-language models (LVLMs) have rapidly advanced, transforming multimodal AI and showing great potential for real-world applications. However, their remarkable capabilities are often overshadowed by massive computational and memory costs, severely hindering practical deployment. While some studies propose more efficient architectures or incorporate distillation, these typically demand full retraining, incurring substantial overhead. As a result, there has been a growing focus on training-free acceleration methods for LVLMs, which are more economical and can be broadly classified into token compression (eliminating redundant tokens) and parameter compression (reducing parameter size via pruning or quantization).

Current evaluations of these methods often use outdated models, limited benchmarks, or narrow metrics, failing to capture generalization or loyalty. There's also a lack of systematic exploration into performance-efficiency trade-offs. To address these limitations, we propose EFFIVLM-BENCH, a unified evaluation framework for systematically assessing training-free acceleration methods of LVLMs. EFFIVLM-BENCH spans a wide range of representative model architectures and tasks, employing comprehensive metrics for performance, generalization, loyalty, and efficiency. With EFFIVLM-BENCH, we conduct a thorough comparison of mainstream token and parameter compression methods, explore Pareto-optimal trade-offs, and offer a nuanced understanding of their strengths and limitations to guide future research and practical deployment.

Key Features of EFFIVLM-BENCH

1. Comprehensive SOTA Models & Methods

EFFIVLM-BENCH provides a robust platform by incorporating:

  • Leading LVLMs: Evaluation across frontier models like LLaVA-OneVision(OV)-7B, Qwen2-VL-7B, and InternVL2.5-38B.
  • Mainstream Acceleration Techniques: Systematic evaluation of:
    • ✂️ Token Compression: Visual Token Pruning (e.g., FastV, VisionZip, PruMerge+) and KV Cache Compression (e.g., StreamingLLM, H2O, VL-Cache).
    • 🗃️ Parameter Compression: Weight Pruning (e.g., Wanda, SparseGPT) and Quantization (e.g., AWQ, GPTQ).
  • Diverse Benchmarks: Evaluation spans 17 widely-used benchmarks.

2. Holistic Evaluation Metrics

A unified framework employing comprehensive metrics to assess methods on:

  • Overall Performance (OP): Measures compressed model's performance relative to the original.
  • Generalization (OG): Evaluates consistency across models and benchmarks (lower is better).
  • Loyalty (OL): Assesses preservation of the original model's predictions.
  • Efficiency (OE): Reflects real-world latency via actual inference time speedup.

3. In-depth Analyses & Insights

EFFIVLM-BENCH offers actionable insights by exploring:

  • Pareto-Optimal Trade-offs: Balancing model performance and inference efficiency.
  • Compression Mechanisms Analysis:
    • 🎚️ Layer-adaptive vs. static budget allocation in KV cache.
    • 🎯 Impact of head-adaptive token selection in KV cache.
    • ⚓ Significance of "attention sink" tokens.
    • 🧩 Optimal strategies for merging evicted tokens.
  • Open Source: Provides code and recipes to foster future research.

Benchmark Results and Key Findings

Token Compression Findings

We evaluate token compression effectiveness by examining two mainstream approaches: (1) token pruning (e.g., FastV, VisionZip, PruMerge+) and (2) KV cache compression (e.g., StreamingLLM, H2O, SnapKV, PyramidKV, LOOK-M, VL-Cache).

Performance of KV Cache Compression Methods
Performance comparison of KV cache compression methods.

Observation 1: Token compression performance is task-dependent and shows significant sensitivity to benchmark and model. Most methods are stable at higher budgets but degrade sharply at very low budgets (e.g., 1%), especially on tasks requiring fine-grained visual detail or long outputs. For token pruning at a 1% budget, methods pruning within the visual encoder (e.g., VisionZip, PruMerge+) outperform those pruning in the LLM backbone (e.g., FastV).

Visual Token Pruning Results

Observation 2: KV cache compression outperforms token pruning in generalization and loyalty. Methods like H2O and PyramidKV show strong overall performance. KV cache methods are generally preferred when generalization and loyalty are critical.

Token Compression Generalization and Loyalty

Observation 3: Selecting token pruning or KV cache compression based on task statistics can achieve a better performance-efficiency trade-off. Token pruning drastically reduces Time-To-First-Token (TTFT), ideal for short-response tasks. KV cache methods may offer better performance for tasks with long outputs at low budgets.

Observation 4: Consistent performance trends of token compression methods are observed across single-image, multi-image, and video tasks.

Trade-off Analysis for Token Compression

Parameter Compression Findings

We evaluate two mainstream approaches: pruning (e.g., EcoFLAP, Wanda, SparseGPT) and quantization (e.g., AWQ, GPTQ).

Observation 5: Parameter compression generally preserves performance more effectively than token compression, even at higher compression ratios (e.g., 50% sparsity). Quantization methods like AWQ tend to preserve higher performance compared to pruning. Importantly, token and parameter compression are orthogonal and can be effectively combined.

Parameter Compression Results

Key Insights and Discussion

Revisiting Layer-Adaptive Sparsity in KV Cache Compression

While layer-adaptive sparsity can benefit LLMs, our findings suggest it's not universally advantageous for LVLMs, especially at low sparsity budgets. For instance, VL-Cache's aggressive front-loading of budget to early layers can starve subsequent layers. A hybrid allocation (e.g., 80% uniform + 20% adaptive) showed improved performance.

VL-Cache Budget Allocation

Revisiting Head-Adaptive Mechanism in KV Cache Compression

Allowing different heads within the same layer to select cache tokens adaptively (head-adaptive) generally improves performance under aggressive budget constraints (e.g., 1% budget) by retaining more critical information.

Head Adaptive Analysis

Attention Sink Tokens in LVLMs

Attention sink tokens (tokens that receive high attention regardless of semantic relevance) are present in both text and image modalities in LVLMs. Removing these can degrade performance. Methods like StreamingLLM that preserve text sink tokens show improvements. Visual sink tokens also significantly impact performance; text-guided visual token pruning (e.g., FastV) may fail to capture these crucial visual sink tokens, unlike image-guided pruning (e.g., VisionZip).

Merging Strategies for Evicted Tokens

Merging evicted tokens can help recover information. However, cross-modal merging (e.g., visual tokens into text tokens) can disrupt critical textual features, especially at very low budgets, leading to performance degradation as seen with LOOK-M. Modality-specific merging (merging evicted tokens only within the same modality) shows improved performance.

BibTeX


        @misc{wang2025effivlmbenchcomprehensivebenchmarkevaluating,
          title={EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models}, 
          author={Zekun Wang and Minghua Ma and Zexin Wang and Rongchuan Mu and Liping Shan and Ming Liu and Bing Qin},
          year={2025},
          eprint={2506.00479},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2506.00479}, 
    }
      

Acknowledgements

Ming Liu is the corresponding author. The work is supported by the National Key Research and Development Project (2022YFF0903301), the National Science Foundation of China (U22B2059,62276083). Besides, we express our gratitude to DuXiaoman Technology for supporting the work.