2026.05
Foundation Models
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
This technical report presents the MiniMax-M2 series, MoE language models with a small active-parameter footprint designed for real-world agentic deployment. It combines agent-driven verifiable data pipelines, the Forge agent-native RL system, and early self-evolution in M2.7 to improve coding, deep-search, office-task, and reasoning performance.
2026.04
Foundation Models
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
This survey argues that continuous latent space is becoming a native computational substrate for language-based models, addressing the inefficiencies of explicit token-level generation such as redundancy, discretization bottlenecks, and semantic loss. It further organizes the field through mechanism and ability perspectives, and outlines key open challenges for future research.
2026.02
Foundation Models
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 is a next-generation foundation model targeting long-horizon agentic engineering, with reduced training and inference cost and preserved long-context capability. It introduces asynchronous RL infrastructure and agent RL algorithms to improve post-training efficiency and real-world coding performance.
2026.02
Foundation Models
Kimi K2.5: Visual Agentic Intelligence
This paper introduces an open-source multimodal agentic model that jointly optimizes text and vision through unified pretraining, SFT, and reinforcement learning. It also proposes Agent Swarm, a parallel orchestration framework for decomposing and executing complex tasks with coordinated agents.
2026.01
Foundation Models
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B-parameter MoE foundation model with 15B active parameters, built for fast reasoning, coding, and agentic workloads through hybrid sliding-window/global attention, 27T-token pretraining, and long-context extension to 256k. It introduces Multi-Teacher On-Policy Distillation for scalable post-training and repurposes multi-token prediction as a draft model for speculative decoding speedups.
2026.01
Foundation Models
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
This paper introduces conditional memory as a sparsity axis complementary to MoE, instantiated by Engram for constant-time lookup of static knowledge. A scaling law guides the allocation between neural computation and memory, enabling Engram models to improve knowledge, reasoning, code, math, and long-context retrieval at matched parameters and FLOPs.
2025.12
Foundation Models
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2 is an open large language model that combines efficient long-context computation with strong reasoning and agent performance. Its key ingredients include DeepSeek Sparse Attention, scalable RL post-training, and a large-scale agentic task synthesis pipeline for improving tool-use generalization and instruction-following robustness.
2025.08
Foundation Models
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5 introduces an open-source MoE foundation model with hybrid reasoning modes (thinking/direct response) to better support agentic, reasoning, and coding tasks. It combines large-scale pretraining and RL-based post-training, and releases both full and compact variants with strong benchmark performance.
2025.07
Foundation Models
Kimi K2: Open Agentic Intelligence
Kimi K2 presents a trillion-parameter MoE language model focused on strong agentic, reasoning, and coding capabilities with stable large-scale training. The work introduces MuonClip with QK-clip to improve optimization stability and token efficiency during pretraining.
2025.05
Foundation Models
Qwen3 Technical Report
This report presents the Qwen3 family spanning dense and MoE models across a wide parameter range, emphasizing stronger multilingual performance and efficiency. It unifies deliberative thinking and fast response modes in one framework and scales post-training to improve reasoning, coding, and agentic behavior.
2025.01
Foundation Models
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax-01 introduces a long-context model family built around Lightning Attention and MoE to improve scaling efficiency and practical throughput. It combines optimized parallelization and communication-computation overlap to train large models with stronger long-context performance.
2024.12
Foundation Models
DeepSeek-V3 Technical Report
DeepSeek-V3 is a 671B-parameter MoE language model with 37B activated parameters per token, built for efficient inference and cost-effective large-scale training. It extends MLA and DeepSeekMoE with auxiliary-loss-free load balancing and a multi-token prediction objective, achieving strong open-model performance with stable 14.8T-token pretraining and SFT/RL post-training.
2024.09
Foundation Models
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
This paper presents Qwen2.5-Math, a family of math-specialized language models that applies self-improvement throughout pre-training, post-training, and inference. The approach strengthens mathematical reasoning and tool-augmented problem solving across multiple model sizes.
2024.07
Foundation Models
Qwen2 Technical Report
This report introduces the Qwen2 series of dense and mixture-of-experts language models, covering base and instruction-tuned variants across a broad parameter range. It emphasizes stronger multilingual, coding, math, and reasoning performance while remaining competitive with proprietary systems.
2024.05
Foundation Models
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 is a 236B-parameter MoE language model with 21B activated parameters per token and 128K context length, designed for economical training and efficient inference. It combines Multi-head Latent Attention for KV-cache compression with DeepSeekMoE sparse computation, reducing training cost and KV cache while improving throughput and open-model performance.