Reinforcement Learning

Reward modeling, RLHF-style optimization, reasoning RL, agent RL, and VLA policy learning.

Research category

Reward modeling, RLHF-style optimization, reasoning RL, agent RL, and VLA policy learning.

32Papers

68Resource links

2026.06Latest month

7 papers

Policy Optimization

2026.06 Policy Optimization

Rethinking the Divergence Regularization in LLM RL

This paper proposes Divergence Regularized Policy Optimization (DRPO), replacing DPPO's hard divergence mask with a smooth advantage-weighted quadratic regularizer that preserves its trust-region geometry. DRPO provides bounded continuous gradient weights and corrective signals beyond the trust-region boundary, improving LLM RL training stability and efficiency.

Paper Code

2026.05 Policy Optimization

Constraint-Infused Policy Optimization: Principles and Practices for Harnessing Advanced LLM Reasoning

This paper formulates LLM reinforcement learning as constrained policy optimization, unifying existing algorithms through different constraint choices and exposing the roles of clipping, KL regularization, and trust regions. It derives Constraint-Infused Policy Optimization (CIPO), which improves reasoning performance and training stability across diverse tasks and model families.

Paper Code

2025.07 Policy Optimization

Group Sequence Policy Optimization

This paper introduces GSPO, a reinforcement learning algorithm for LLMs that replaces token-level importance ratios with sequence-level likelihood ratios and performs sequence-level clipping, rewarding, and optimization. GSPO improves training efficiency and performance over GRPO, stabilizes MoE RL training, and helps simplify large-scale RL infrastructure for Qwen3 models.

Paper Project

2025.03 Policy Optimization

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

This paper introduces Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), an open large-scale reinforcement learning system for eliciting LLM reasoning. It releases the training recipe, code, dataset, and model weights, reaching 50 points on AIME 2024 with Qwen2.5-32B and improving reproducibility for large-scale LLM RL.

Paper Project Code Hugging Face

2024.02 Policy Optimization

KTO: Model Alignment as Prospect Theoretic Optimization

This paper frames successful LLM alignment losses as human-aware losses that encode biases from prospect theory, then introduces KTO to optimize generation utility directly from binary desirable/undesirable feedback. KTO matches or exceeds preference-pair methods from 1B to 30B scales, highlighting how the best alignment loss depends on the setting's inductive biases.

Paper

2023.05 Policy Optimization

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

This paper introduces Direct Preference Optimization (DPO), which reparameterizes the RLHF reward model so the optimal policy can be learned directly from preference data with a simple classification loss. DPO removes separate reward-model fitting and online reinforcement learning while matching or improving PPO-based RLHF with simpler, more stable training.

Paper Code

2017.07 Policy Optimization

Proximal Policy Optimization Algorithms

This paper introduces Proximal Policy Optimization (PPO), a family of policy-gradient methods that alternates environment sampling with multiple minibatch epochs on a surrogate objective. PPO retains key trust-region benefits while being simpler to implement and empirically balancing sample efficiency, performance, and wall-clock time.

Paper Project Code

5 papers

OPD

2026.05 OPD

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

This paper analyzes why on-policy self-distillation can hurt math reasoning, showing through pointwise mutual information that privileged context overemphasizes solution-implied tokens while suppressing deliberation tokens needed for search. It proposes AntiSD, which ascends rather than descends the self-distillation divergence with an entropy gate, reaching GRPO-level accuracy in 2 to 10x fewer steps and improving final accuracy by up to 11.5 points.

Paper Code

2026.05 OPD

Draft-OPD: Adapting Speculative Draft Models from LLMs via On-Policy Distillation

This paper proposes Draft-OPD, which adapts speculative draft models from RL-trained LLM traces through on-policy distillation without requiring expensive online generation for the draft model. It proves an equivalence between RL training and OPD-style distillation, reuses collected RL experience, and improves speculative decoding speed by up to 2.14x while preserving task performance.

Paper Project Code Hugging Face

2026.05 OPD

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

This paper introduces OmniOPD, a logit-free on-policy distillation framework that replaces brittle token-level logit matching with Monte Carlo chunk rollouts scored by semantic similarity, enabling black-box teachers. A peak-entropy scheduler focuses verification on uncertain reasoning forks, while Bayesian smoothing and a base-model KL anchor stabilize training; it outperforms standard OPD by up to 28.64% on math.

Paper

2026.04 OPD

Self-Distilled RLVR

This paper studies on-policy self-distillation for RLVR and shows that relying only on a privileged self-teacher can cause information leakage and unstable long-term training. It proposes RLSD, which uses self-distillation to estimate token-level update magnitudes while keeping RLVR's environment feedback as the reliable update direction.

Paper Hugging Face

2026.02 OPD

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

This paper shows that on-policy distillation is a special case of dense KL-constrained RL, then proposes G-OPD with a flexible reference model and reward scaling factor. Its reward extrapolation variant, ExOPD, improves over standard OPD and can let students surpass domain teachers when merging RL-trained experts.

Paper Code Hugging Face

4 papers

Reward Modeling

2026.03 Reward Modeling

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

This paper introduces MemReward, a graph-based experience memory framework that achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B models for reward prediction with limited labels, surpassing Oracle in out-of-domain tasks.

Paper Code Hugging Face

2026.03 Reward Modeling

Scaling Reward Modeling without Human Supervision

This paper studies unsupervised reward model scaling by learning preferences over web-corpus document prefixes and suffixes without human annotations. It reports consistent RewardBench gains across model backbones and shows downstream improvements in best-of-N selection and policy optimization.

Paper

2026.01 Reward Modeling

Reward Modeling from Natural Language Human Feedback

This paper introduces RLVR on preference data for training Generative Reward Models, demonstrating that binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques, and proposes a method to address this limitation.

Paper

2025.10 Reward Modeling

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

This survey reviews Process Reward Models for evaluating and guiding LLM reasoning at the step or trajectory level rather than only judging final answers. It organizes the full loop of process data generation, PRM construction, and PRM use in test-time scaling and reinforcement learning across math, code, multimodal reasoning, robotics, and agents.

Paper

1 paper

Video Generation RL

2026.05 Video Generation RL

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

KVPO aligns streaming autoregressive video generators with human preferences using an ODE-native online GRPO framework. It replaces noise-based exploration with causal-semantic routing of historical KV cache entries and optimizes a velocity-field surrogate policy based on Trajectory Velocity Energy.

Paper Project Code Hugging Face

1 paper

Multimodal RL

2025.09 Multimodal RL

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

This paper introduces CapRL, the first RLVR framework for open-ended image captioning, which rewards captions by whether a vision-free language model can answer image questions using only the generated description. The resulting CapRL-3B model produces more informative and diverse captions, while its generated caption data improves large vision-language model pretraining across 12 benchmarks.

Paper Code Hugging Face

8 papers

Reasoning RL

2026.04 Reasoning RL

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

This paper presents MUPO, a reinforcement learning approach that addresses diversity collapse in GRPO-trained VLMs by incentivizing divergent thinking across multiple solutions, enabling deeper yet broader reasoning patterns.

Paper Project Code Hugging Face

2026.04 Reasoning RL

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

This paper introduces VL-Calibration, a reinforcement learning framework that separates visual and reasoning confidence in large vision-language models to address confidently incorrect predictions. It estimates visual certainty from image-perturbation grounding and token entropy, then applies token-level advantage reweighting to improve calibration and visual reasoning accuracy.

Paper Code

2026.03 Reasoning RL

The Art of Efficient Reasoning: Data, Reward, and Optimization

This paper studies efficient reasoning in LLMs, using RL to incentivize short accurate trajectories, with findings on training stages, rewards, and generalization across models from 0.6B to 30B parameters.

Paper Project

2026.03 Reasoning RL

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

This paper presents FIPO, a reinforcement learning algorithm that overcomes reasoning bottlenecks in LLMs by addressing coarse-grained credit assignment in GRPO-style training, where outcome-based rewards fail to distinguish critical logical pivots from trivial tokens.

Paper

2026.02 Reasoning RL

Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis

This paper introduces Agentic Proposing, a framework that uses a specialized agent with Multi-Granularity Policy Optimization (MGPO) to dynamically select and compose modular reasoning skills for synthesizing high-precision training trajectories.

Paper

2025.05 Reasoning RL

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

This paper studies policy entropy collapse as a bottleneck in RL for reasoning language models, showing an empirical relationship between entropy and downstream performance that makes the performance ceiling predictable. It derives entropy dynamics from the covariance between action probability and logit updates, then proposes Clip-Cov and KL-Cov to preserve exploration and improve downstream performance.

Paper

2025.01 Reasoning RL

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

This paper shows that pure reinforcement learning can directly elicit advanced reasoning behaviors in LLMs without human-labeled reasoning traces. The proposed framework induces self-reflection, verification, and adaptive strategy use, leading to strong gains on math, coding, and STEM reasoning tasks.

Paper Hugging Face

2024.02 Reasoning RL

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

This paper introduces DeepSeekMath 7B, combining a carefully engineered web-scale math data selection pipeline with Group Relative Policy Optimization (GRPO), a PPO variant. The approach improves mathematical reasoning while reducing PPO's memory usage, reaching strong competition-level MATH performance without external tools or voting.

Paper Code Hugging Face

5 papers

Agentic RL

2026.05 Agentic RL

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Spreadsheet-RL is an RL fine-tuning framework for training specialized spreadsheet agents in a realistic Microsoft Excel environment, addressing complex multi-step workflows that prompting-based agents struggle with. It adds automated start-goal spreadsheet data collection, a multi-turn Spreadsheet Gym with sandboxed Excel tools, and a Domain-Spreadsheet benchmark to improve real-world spreadsheet automation.

Paper Project Code Hugging Face

2026.02 Agentic RL

Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

This paper introduces Actor-Refiner collaboration to address the multi-scale credit assignment problem in search-integrated reasoning RL, where sparse trajectory-level rewards fail to distinguish high-quality reasoning from fortuitous guesses, reducing redundant or misleading search behaviors.

Paper

2026.01 Agentic RL

Arena-RL: Training LLMs as Game Players with Vision-Language Action Models

This paper introduces Arena-RL, a reinforcement learning framework that trains LLM-driven agents to play visual games via vision-language action models, focusing on policy improvement from interactive game feedback. It demonstrates that reward-driven optimization over game trajectories can significantly improve strategic decision-making and generalization across game environments.

Paper

2025.03 Agentic RL

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

This paper introduces Search-R1, an RL framework where LLMs learn to autonomously generate search queries during step-by-step reasoning with real-time retrieval, improving their ability to acquire external knowledge and up-to-date information.

Paper Code

2025.01 Agentic RL

Search-o1: Agentic Search-Enhanced Large Reasoning Models

This paper introduces Search-o1, a framework that enhances large reasoning models with an agentic retrieval-augmented generation mechanism and a Reason-in-Documents module for refining retrieved documents, addressing knowledge insufficiency in extended reasoning processes.

Paper Code

1 paper

VLA RL

2025.11 VLA RL

SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

This paper proposes SRPO, a reinforcement learning framework for vision-language-action models that replaces sparse binary rewards with progress-wise rewards derived from the model's own successful trajectories. It uses latent world-model representations to measure behavioral progress robustly and achieves state-of-the-art manipulation success on LIBERO with far fewer RL steps.

Paper