Multimodal LLMs

Vision-language, video-language, and VLA research that connects perception with language reasoning.

Research category

Vision-language, video-language, and VLA research that connects perception with language reasoning.

15Papers

43Resource links

2026.05Latest month

13 papers

Vision-Language

2026.05 Vision-Language

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance introduces a lightweight native unified multimodal model for image and video understanding, generation, and editing without relying mainly on capacity scaling. It combines shared interleaved context modeling, decoupled capability pathways, dual-stream MoE, modality-aware rotary positional encoding, and staged multi-task training to improve both generation and understanding.

Paper Project Code Hugging Face

2026.04 Vision-Language

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

This paper introduces Video-MME-v2, an improved video understanding benchmark addressing the saturation issue in existing benchmarks where inflated leaderboard scores fail to reflect real-world model capabilities.

Paper Project Code Hugging Face

2026.04 Vision-Language

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

V-Reflection converts MLLMs from passive visual consumers to active interrogators through a think-then-look reflection mechanism that grounds each reasoning step in visual evidence. A two-stage distillation design improves fine-grained perception while keeping inference fully autoregressive and efficient.

Paper Project Code Hugging Face

2026.03 Vision-Language

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

This paper proposes a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics prediction with long-horizon semantic guidance through a dual-temporal design. It further introduces a hierarchical pyramid representation extraction module to transfer multi-layer VLM reasoning features into latent forecasting for more robust hand-manipulation trajectory prediction.

Paper

2025.11 Vision-Language

Qwen3-VL Technical Report

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video.

Paper Project Code Hugging Face

2025.08 Vision-Language

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

This paper introduces InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency, featuring the Cascade Reinforcement Learning framework.

Paper

2025.02 Vision-Language

Qwen2.5-VL Technical Report

This technical report introduces Qwen2.5-VL, a flagship vision-language model with stronger visual recognition, precise localization, robust document parsing, and long-video understanding. It also improves agentic interaction with visual environments through better grounding and structured perception capabilities.

Paper Code Hugging Face

2024.12 Vision-Language

InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

This paper introduces InternVL 2.5, an advanced multimodal LLM series that was the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought reasoning.

Paper Hugging Face

2024.09 Vision-Language

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

This paper introduces Qwen2-VL, a vision-language model series that uses Naive Dynamic Resolution to process images at arbitrary resolutions and M-RoPE to fuse text, image, and video positional information. Scaling the model to 2B, 8B, and 72B parameters with larger multimodal data yields competitive image, video, multilingual OCR, document understanding, and agentic visual interaction performance.

Paper Project Code Hugging Face

2024.07 Vision-Language

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

This paper introduces LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch scenarios in large multimodal models, extending visual instruction tuning to multi-modal scenarios.

Paper Project Code

2024.05 Vision-Language

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

This paper introduces CVRR-ES, a benchmark that comprehensively assesses Video-LMMs across 11 diverse real-world video dimensions, evaluating 9 recent models and finding that most open-source Video-LMMs struggle with robustness and reasoning on complex videos.

Paper Project

2023.08 Vision-Language

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

This paper introduces Qwen-VL, a vision-language model series built on Qwen-LM with a visual receptor, multimodal interface, three-stage training pipeline, and multilingual multimodal corpus. By aligning image-caption-box tuples, Qwen-VL supports visual understanding, grounding, and text reading while achieving strong results across visual-centric benchmarks.

Paper Code Hugging Face

2023.04 Vision-Language

LLaVA: Visual Instruction Tuning

This paper presents LLaVA, a large multimodal model trained end-to-end on machine-generated instruction tuning data, showing impressive multimodal chat abilities and achieving state-of-the-art results on Science QA.

Paper Project

1 paper

Multimodal Reasoning

2025.03 Multimodal Reasoning

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

This survey addresses the lack of an up-to-date review of multimodal Chain-of-Thought reasoning in MLLMs across image, video, speech, audio, 3D, and structured data. It introduces foundational definitions, a comprehensive taxonomy, methodological analysis across applications, and open challenges for future multimodal reasoning research.

Paper Project

1 paper

VLA

2026.04 VLA

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

OneVL addresses real-time trajectory planning in VLA-based autonomous driving by compressing Chain-of-Thought reasoning into compact latent tokens supervised by both language reconstruction and future-frame prediction. Its three-stage training pipeline yields latent reasoning that surpasses explicit CoT while keeping answer-only inference latency.

Paper Project Code Hugging Face