From Qwen-VL to Qwen3-VL: Four Generations of Architecture and Training

A technical review of how four Qwen-VL generations evolved across vision-language alignment, dynamic resolution, spatiotemporal position encoding, video modeling, and deep visual fusion.

Author

Brench

Published

June 15, 2026

TL;DR: This note traces the architectural and training changes across four Qwen-VL generations. Qwen-VL established a three-stage vision-language alignment pipeline. Qwen2-VL introduced dynamic resolution, M-RoPE, and native video input. Qwen2.5-VL addressed inference cost, physical-time modeling, and post-training data quality. Qwen3-VL moved visual injection deeper into the early LLM layers.

The Qwen-VL models are useful to study as a series because each generation directly addresses constraints introduced or exposed by the previous one. Qwen-VL first connected visual features to a language model. Qwen2-VL and Qwen2.5-VL then addressed dynamic resolution, video input, position encoding, and compute cost. Qwen3-VL shifted the question toward where and at what granularity visual features should participate in LLM computation.

Part	Model	Main topics
Part I	Qwen-VL (2023)	Three-stage training, vision-language alignment, unified multitask modeling
Part II	Qwen2-VL (2024)	M-RoPE, 3D convolution, native dynamic resolution
Part III	Qwen2.5-VL (2025)	Window attention, dynamic FPS, rejection sampling, and CoT
Part IV	Qwen3-VL (2025)	Interleaved MRoPE, DeepStack, explicit timestamps

1. Part I: Qwen-VL: Three-Stage Vision-Language Alignment (2023)

Qwen-VL is based on Qwen-7B. Its main contribution is a progressive three-stage training pipeline: Align, Enhance, and Chat. Later generations changed position encoding, added video input, and expanded the training data, but retained this staged training pattern.

1.1. Progressive Capability Building

Training gradually relaxes parameter and task constraints. The first stage maps visual features into Qwen-7B’s language space. The second adds fine-grained grounding, OCR, and VQA capabilities through multitask data. The third adjusts instruction following and conversational behavior.

This sequence balances training stability against fusion depth. Updating the LLM immediately with noisy web image-text pairs risks damaging its language capabilities. Keeping it frozen throughout training prevents it from learning the relationship between spatial information, visual details, and textual instructions.

The three objectives are:

Align: establish a basic image-text mapping.
Enhance: add grounding, OCR, chart understanding, and related abilities through multitask learning.
Align with Humans: turn the model into the instruction-following Qwen-VL-Chat.

1.2. Stage 1: Pre-training

The first stage trains the visual encoder and adapter to compress an image into a feature sequence accepted by the LLM. It uses roughly 1.4 billion cleaned web image-text pairs from sources such as LAION, DataComp, and Coyo.

The sample format is:

<img> [visual feature sequence] </img> [text description] <eos>

The visual encoder and adapter convert an image into 256 vectors. Qwen-7B remains frozen, while the ViT and adapter are optimized with the standard autoregressive cross-entropy objective. This stage produces coarse alignment rather than reliable grounding, OCR, or visual reasoning.

1.3. Stage 2: Multi-task Pre-training

The second stage introduces higher-quality annotations across seven task families. Prompts and context are excluded from the loss, while answers, captions, coordinates, and OCR text become generation targets.

Task	Input	Target	Training signal
Image Captioning	Image and caption prompt	Caption	Image description
VQA	Image and question	Answer	Visual question answering
OCR VQA	Image and text-related question	Answer	Reading text in images
Caption with Grounding	Image and grounding prompt	Caption with boxes	Joint captioning and localization
Referring Grounding	Image and referring phrase	Box coordinates	Text-to-region localization
Grounded Captioning	Image and a specified box	Region description	Region-to-text description
OCR	Image and OCR prompt	Text with quadrilateral coordinates	Text recognition and localization

Grounding is modeled as text generation rather than through a separate detection head. The model emits sequences such as <ref>...</ref>, <box>...</box>, and <quad>...</quad>. This gives up some structural priors from specialized detectors, but lets every task share the LLM’s autoregressive objective.

Pure-text data is mixed into this stage to reduce catastrophic forgetting. The ViT, adapter, and LLM are all unfrozen because grounding, OCR, and chart understanding require the LLM to learn spatial relations and instruction semantics.

1.4. Stage 3: Supervised Fine-tuning

The third stage produces Qwen-VL-Chat. Its multimodal instruction and dialogue data includes human-authored samples and data generated with stronger models such as GPT-4. Conversations may contain one or multiple images and follow ChatML formatting.

The visual encoder is frozen again, while the adapter and LLM are trained to improve response organization, instruction following, and dialogue behavior.

Cross-entropy loss is computed only on assistant answers and special markers, not on role names or user prompts. This matches inference, where user input is context and the assistant response is the sequence to predict.

Part I Summary: Qwen-VL establishes basic vision-language alignment through staged training. Its limitations are equally clear: resizing every image to 448×448 loses detail, video is unsupported, and absolute position embeddings are poorly suited to richer multimodal coordinates.

2. Part II: Qwen2-VL: Native Dynamic Resolution and Multimodal Position Encoding (2024)

2.1. Main Changes from Qwen-VL

Qwen2-VL focuses on input representation and position encoding:

It removes absolute position embeddings and adopts 2D-RoPE, allowing images to retain their aspect ratios and use dynamic resolutions.
It introduces M-RoPE to represent text, images, and videos in a shared spatiotemporal coordinate system.
It uses a depth-2 3D convolution to merge 2D patches from adjacent frames into 3D tubes.
It expands multilingual capabilities.

The central question changes from how to connect an image to an LLM to how to assign consistent coordinates across modalities.

2.2. M-RoPE

M-RoPE assigns each token three position components: \((t, h, w)\). Text, images, and videos still enter the model as one sequence, but their position encoding no longer depends on a single one-dimensional index.

2.2.1. Computation

For input features \(X \in \mathbb{R}^{B \times L \times D}\), each token has temporal, height, and width indices \(P_t\), \(P_h\), and \(P_w\), each shaped \((B, L)\). The hidden dimension is split into three subspaces:

\[ X_t = X[\ldots, 0:D_t], \quad X_h = X[\ldots, D_t:D_t + D_h], \quad X_w = X[\ldots, D_t + D_h:D] \]

RoPE is applied independently:

\[ X'_t = \text{RoPE}(X_t, P_t), \quad X'_h = \text{RoPE}(X_h, P_h), \quad X'_w = \text{RoPE}(X_w, P_w) \]

The subspaces are concatenated:

\[ X_{out} = \text{Concat}(X'_t, X'_h, X'_w, \text{dim}=-1) \]

The output remains shaped \((B, L, D)\). Temporal differences are primarily represented in \(X'_t\), while row and column differences appear in \(X'_h\) and \(X'_w\).

2.2.2. Problems Addressed

2.2.2.2. Position Extrapolation in Long Videos

A 1,000-frame video with 256 tokens per frame produces 256,000 tokens. A one-dimensional RoPE index would therefore exceed 250,000, far beyond a model trained only up to 32k positions.

M-RoPE decomposes this large index. The sequence may contain 250k tokens, but the temporal index may reach only 1,000 while height and width indices may remain below 16. Spatial coordinates therefore stay in a familiar range, and temporal growth does not simultaneously distort spatial positions.

2.2.3. Spatiotemporal Downsampling with 3D Convolution

2.2.3.1. Purpose

Adjacent video frames contain substantial redundancy. Qwen2-VL uses a depth-2 3D convolution to process the same spatial region in two adjacent frames as a \(2 \times 14 \times 14\) tube. This approximately halves the number of visual tokens for a fixed video duration. Images can be duplicated into two frames, allowing images and videos to share a similar input interface.

2.2.3.2. Implementation

If a conventional ViT produces \(N\) patches per frame, independently processing \(T\) frames creates \(T \times N\) tokens. The depth-2 kernel merges corresponding spatial patches from every two adjacent frames before they enter the later visual stack.

2.3. Training

Qwen2-VL still uses next-token prediction and computes cross-entropy only on text tokens. The LLM is initialized from Qwen2 1.5B, 7B, or 72B, while the ViT is initialized from DFN with absolute position embeddings replaced by 2D-RoPE.

2.3.1. Main Training Principles

2.3.1.1. Stage 1: ViT Training

The first stage trains the ViT and adapter while freezing the LLM. It uses 600B tokens of large-scale weakly labeled image-text data to adapt the ViT to 2D-RoPE and align visual features with Qwen2’s semantic space.

2.3.1.2. Stage 2: Full-Parameter Pre-training

The second stage unfreezes all parameters and adds 800B tokens, bringing the cumulative total to 1.4T. The data covers interleaved image-text documents, OCR, video, and pure text. Native dynamic resolution, M-RoPE, and 3D convolution are active in this stage.

2.3.1.3. Stage 3: Instruction Fine-tuning

The third stage freezes the ViT and trains the LLM on ChatML-formatted multimodal dialogue, long-video QA, agent trajectories, and text-only instructions. Loss is computed only on assistant responses.

Part II Summary: Qwen2-VL addresses fixed image resolution and native video input. M-RoPE provides a shared coordinate system, 3D convolution reduces video tokens, and dynamic resolution avoids unnecessary resizing. These changes expose new bottlenecks: global ViT attention scales quadratically with high-resolution inputs, while video time is still represented by relative frame indices rather than physical time.

3. Part III: Qwen2.5-VL: Inference Efficiency, Time Modeling, and Training Data Quality (2025)

3.1. Main Changes from Qwen2-VL

Qwen2.5-VL introduces window attention to control high-resolution inference cost, dynamic FPS sampling for videos with different sampling rates, and absolute-time MRoPE to align temporal positions with physical time.

3.2. Window Attention

Global ViT attention has complexity \(O(N^2)\). Under dynamic resolution, \(N\) grows with image area. Qwen2.5-VL limits most layers to local windows, making their cost approximately linear in image area.

3.2.1. Computation

The image dimensions are adjusted to multiples of 28 and split into \(14 \times 14\) patches:

\[ L = (H/14) \times (W/14) \]

A \(112 \times 112\) pixel window contains \(8 \times 8 = 64\) patches:

\[ N_{win} = \frac{L}{8 \times 8} = \frac{L}{64} \]

Each window independently computes:

\[ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Local attention weakens cross-window communication, so layers {7, 15, 23, 31} retain full self-attention.

3.3. Dynamic FPS Sampling

Video tokens are still organized as 3D tubes with a temporal stride of 2. For \(T\) sampled frames at resolution \(H \times W\), the token count changes from:

3.3.1. Computation

\[ T \times (H/14) \times (W/14) \]

to:

\[ \frac{T}{2} \times (H/14) \times (W/14) \]

The model can accept inputs sampled at rates such as 0.5 FPS or 2 FPS, provided that the physical time represented by each tube is encoded later.

3.4. MRoPE with Absolute Time

Relative frame numbers cannot distinguish whether two adjacent sampled frames are 0.1 seconds or 10 seconds apart. Qwen2.5-VL maps each visual tube to its physical timestamp:

3.4.1. Computation

\[ ID_t^{(i)} = \text{Round}(t_{abs}^{(i)} \times v) \]

If \(v=2\), timestamps 0.0s, 0.5s, and 2.0s map to positions 0, 1, and 4. Position IDs may therefore be discontinuous. The temporal, height, and width components are applied as:

\[ 0,\quad 1,\quad 4 \]

The hidden dimension is split into temporal, height, and width subspaces:

\[ X_t = X[\ldots, 0:D_t], \quad X_h = X[\ldots, D_t:D_t + D_h], \quad X_w = X[\ldots, D_t + D_h:D] \]

RoPE is then applied independently:

\[ X'_t = \text{RoPE}(X_t, ID_t), \quad X'_h = \text{RoPE}(X_h, ID_h), \quad X'_w = \text{RoPE}(X_w, ID_w) \]

The three results are concatenated into the complete feature:

\[ X_{out} = \text{Concat}(X'_t, X'_h, X'_w, \text{dim}=-1) \]

For video tokens \(i\) and \(j\), temporal phase differences depend on:

\[ \Delta_t = ID_t^{(i)} - ID_t^{(j)} \]

The model can therefore distinguish physical time spans rather than only relative frame distances.

3.5. How the Three Mechanisms Work Together

Dynamic FPS determines which frames enter the model, and 3D tubes reduce sequence length. Absolute-time MRoPE records the physical timestamp represented by each tube. Most ViT layers then use local window attention, while a few full-attention layers exchange information across windows. Together, these mechanisms address long video sequences, irregular sampling intervals, and the cost of global attention on high-resolution inputs.

3.6. Difference from Swin Transformer’s Shifted Windows

3.6.1. Cross-Window Information Exchange

Swin Transformer alternates standard and shifted windows so information propagates through overlapping regions. Qwen2.5-VL keeps fixed, non-overlapping windows in most layers and inserts a small number of full-attention layers.

3.6.2. Why Not Use Shifted Windows?

Dynamic image dimensions make shifted windows more expensive to implement efficiently because irregular boundaries require additional padding and masking. Fixed windows plus occasional full attention are simpler under dynamic resolution and can still use optimized kernels such as FlashAttention.

3.7. Training

Qwen2.5-VL uses five stages across pre-training and post-training. Its data scale expands to 4.1T tokens, with more emphasis on high-resolution inputs, long videos, reasoning data, and preference alignment.

3.7.1. Pre-training

Stage 1: Visual Encoder Initialization

Only the redesigned ViT is trained on image-text pairs, visual knowledge data, and OCR data to establish stable visual representations.

Stage 2: Multimodal Pre-training

All parameters are unfrozen. Interleaved image-text data, VQA, multitask data, and pure text are trained with an 8,192-token context.

Stage 3: Long-Context Pre-training

The context length expands to 32,768 tokens. Long videos, agent trajectories, and high-resolution documents are added. Dynamic packing reduces load imbalance across samples with different visual token counts.

3.7.2. Post-training

Stage 4: Supervised Fine-tuning

The ViT is frozen and the LLM is trained on roughly 2 million ChatML samples, split evenly between text-only and multimodal dialogue. Rule-based filtering removes duplicates and corrupt samples, while a 72B model filters image-text relevance.

For mathematics, code, and selected VQA tasks, rejection sampling generates multiple CoT candidates. Ground-truth answers or verifiers retain only correct, high-quality reasoning traces.

Stage 5: Direct Preference Optimization

DPO continues to freeze the ViT and optimizes the LLM from preferred and rejected answer pairs. It is used on image-text and pure-text data to reduce hallucination and improve preference alignment.

3.7.3. Rejection Sampling

Rejection sampling acts as Best-of-N data construction:

An intermediate Qwen2.5-VL generates \(N\) responses for one prompt.
Hard verification checks final mathematical answers, code tests, or VQA ground truth.
Quality filters remove code switching, repetition, excessive length, and malformed outputs.
Retained CoT samples are added back into the SFT dataset.

This process fills in reasoning traces for datasets that contain only questions and final answers. It also produces verified samples closer to the current model’s output distribution. For VLMs, visual-text consistency remains essential: a well-formatted CoT that describes objects absent from the image must still be rejected.

Part III Summary: Qwen2.5-VL keeps Qwen2-VL’s input representation while addressing its compute and data bottlenecks. Window attention controls high-resolution cost, dynamic FPS and absolute-time MRoPE improve video-time representation, and 4.1T tokens plus rejection sampling broaden training coverage and improve CoT data quality.

4. Part IV: Qwen3-VL: Deep Vision-Language Fusion (2025)

Qwen3-VL addresses uneven frequency allocation across MRoPE axes and the limited fusion depth caused by injecting visual information only at the LLM input. Its main changes are Interleaved MRoPE, DeepStack, and explicit video timestamps.

4.1. Main Architectural Changes

4.1.1. Interleaved MRoPE

Standard MRoPE assigns contiguous embedding blocks to time, height, and width. This can restrict each axis to a particular frequency range. Interleaved MRoPE distributes the three axes throughout the embedding dimension, allowing each one to access both low- and high-frequency bands.

4.1.2. DeepStack

Conventional vision-language models usually project only the final ViT layer into the LLM. This interface is simple, but low-level texture and small-object information may disappear from deep semantic features.

DeepStack extracts visual tokens from multiple SigLIP-2 layers. Projected low- to high-level features are injected through residual connections into the first three LLM layers. The model gains access to both semantic and fine-grained visual features without appending additional tokens to the context sequence.

4.1.3. Explicit Video Timestamps

Qwen2.5-VL represents physical time through time-synchronized MRoPE, which can produce large, sparse position IDs in long videos. Qwen3-VL instead inserts textual timestamps before groups of video frames, using formats such as <125.5 seconds> and <00:02:05>. This makes time directly available in the text sequence and reduces dependence on a fixed sampling rate.

5. Summary and Open Questions

5.1. Technical Progression Across Four Generations

Dimension	Qwen-VL (2023)	Qwen2-VL (2024)	Qwen2.5-VL (2025)	Qwen3-VL (2025)
Visual encoder	ViT + fixed resolution	ViT + native dynamic resolution	ViT + window attention	SigLIP-2 + DeepStack
Position encoding	Absolute position embedding	Blockwise M-RoPE	M-RoPE + absolute time	Interleaved MRoPE
Video processing	Unsupported	3D convolutional downsampling	Dynamic FPS sampling	Explicit textual timestamps
Training	Three progressive stages	ViT → full parameters → SFT	Five stages with long context and DPO	Extended staged training
Main change	Basic alignment pipeline	Unified multimodal coordinates	Inference efficiency and data quality	Deep visual fusion

5.2. Recurring Design Patterns

Training stability comes first: visual components are aligned before the LLM participates in more complex tasks.
A unified serialized interface: grounding coordinates, OCR text, and video timestamps are represented as text whenever possible.
Time becomes increasingly explicit: no native video support gives way to relative frame indices, absolute-time positions, and finally textual timestamps.
Data quality gains importance: the pipeline moves from web image-text pairs to multitask annotations and verified CoT generated through rejection sampling.

5.3. Open Questions

DeepStack shifts the question from where to attach the visual encoder to what visual granularity should participate in which LLM layers. Explicit timestamps also show that a longer context window alone does not solve video understanding; time representation and sampling strategy affect localization and long-video description stability.

Two questions remain especially important: whether deep visual injection introduces training instability or modality interference, and how well explicit timestamps generalize across sampling rates and long-video QA tasks.

6. Further Reading

Visual encoders

An Image is Worth 16x16 Words (ViT)
Swin Transformer
Sigmoid Loss for Language Image Pre-Training (SigLIP)

Position encoding and video modeling

Vision-language fusion

Training and alignment

Dynamic resolution and related models

Su Jianlin’s blog Scientific Spaces provides additional derivations and discussions of RoPE, NTK-aware extrapolation, and multimodal position encoding.

This note is based on personal paper reading and may contain omissions.