2026.05
Vision-Language
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance introduces a lightweight native unified multimodal model for image and video understanding, generation, and editing without relying mainly on capacity scaling. It combines shared interleaved context modeling, decoupled capability pathways, dual-stream MoE, modality-aware rotary positional encoding, and staged multi-task training to improve both generation and understanding.
2026.04
Vision-Language
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
This paper introduces Video-MME-v2, an improved video understanding benchmark addressing the saturation issue in existing benchmarks where inflated leaderboard scores fail to reflect real-world model capabilities.
2026.04
Vision-Language
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
V-Reflection converts MLLMs from passive visual consumers to active interrogators through a think-then-look reflection mechanism that grounds each reasoning step in visual evidence. A two-stage distillation design improves fine-grained perception while keeping inference fully autoregressive and efficient.
2026.03
Vision-Language
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
This paper proposes a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics prediction with long-horizon semantic guidance through a dual-temporal design. It further introduces a hierarchical pyramid representation extraction module to transfer multi-layer VLM reasoning features into latent forecasting for more robust hand-manipulation trajectory prediction.
2025.11
Vision-Language
Qwen3-VL Technical Report
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video.
2025.08
Vision-Language
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
This paper introduces InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency, featuring the Cascade Reinforcement Learning framework.
2025.02
Vision-Language
Qwen2.5-VL Technical Report
This technical report introduces Qwen2.5-VL, a flagship vision-language model with stronger visual recognition, precise localization, robust document parsing, and long-video understanding. It also improves agentic interaction with visual environments through better grounding and structured perception capabilities.
2024.12
Vision-Language
InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
This paper introduces InternVL 2.5, an advanced multimodal LLM series that was the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought reasoning.
2024.09
Vision-Language
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
This paper introduces Qwen2-VL, a vision-language model series that uses Naive Dynamic Resolution to process images at arbitrary resolutions and M-RoPE to fuse text, image, and video positional information. Scaling the model to 2B, 8B, and 72B parameters with larger multimodal data yields competitive image, video, multilingual OCR, document understanding, and agentic visual interaction performance.
2024.07
Vision-Language
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
This paper introduces LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch scenarios in large multimodal models, extending visual instruction tuning to multi-modal scenarios.
2024.05
Vision-Language
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
This paper introduces CVRR-ES, a benchmark that comprehensively assesses Video-LMMs across 11 diverse real-world video dimensions, evaluating 9 recent models and finding that most open-source Video-LMMs struggle with robustness and reasoning on complex videos.
2023.08
Vision-Language
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
This paper introduces Qwen-VL, a vision-language model series built on Qwen-LM with a visual receptor, multimodal interface, three-stage training pipeline, and multilingual multimodal corpus. By aligning image-caption-box tuples, Qwen-VL supports visual understanding, grounding, and text reading while achieving strong results across visual-centric benchmarks.
2023.04
Vision-Language
LLaVA: Visual Instruction Tuning
This paper presents LLaVA, a large multimodal model trained end-to-end on machine-generated instruction tuning data, showing impressive multimodal chat abilities and achieving state-of-the-art results on Science QA.