What causes policy entropy collapse in RLVR or GRPO training?

Policy entropy collapse happens when the policy distribution becomes too sharp too early. In PPO-style objectives, clipping can over-reward already likely tokens while suppressing lower-probability alternatives, which reduces exploration and eventually weakens later-stage gradients.

How is dynamic clipping different from Clip-Higher in DAPO?

Clip-Higher mainly relaxes the upper clipping threshold to preserve more exploration. Dynamic clipping goes further by making clipping thresholds depend on token probability or training phase, so entropy can be increased or decreased intentionally instead of only being slowed from collapsing.

What are the ID, DID, and OD entropy-control schedules?

ID means Increase-then-Decrease, DID means Decrease-Increase-Decrease, and OD means Oscillatory Decay. These schedules change clipping behavior over training so entropy follows a more useful trajectory than the monotonic drop often seen in baseline GRPO.

Why does entropy matter for late-stage RLVR performance?

Entropy controls how much the model still explores alternative reasoning paths. When entropy collapses too early, the model converges prematurely, explores less, and can suffer weaker policy gradients later in training, which reduces final reasoning performance.

Source article by Chen Kun and Shi Peng. Original Chinese discussion: Zhihu.

Paper: Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

Flexible Entropy Control in RLVR: Why Dynamic Clipping Matters

Policy entropy collapse is one of the most stubborn failure modes in large-model reinforcement learning. Once entropy drops too quickly, the model explores less, converges too early, and often loses the gradient signal it needs to keep improving in later training stages.

This paper argues that the problem is not just generic instability. It is tightly coupled to how PPO-style clipping reshapes updates across high-probability and low-probability tokens. From that view, the authors build a dynamic clipping framework that can intentionally increase entropy, decrease entropy, or oscillate between the two depending on training needs.

TL;DR

Policy entropy collapse in GRPO hurts both exploration and late-stage gradient quality.
PPO clipping creates specific ratio-space regions that either increase or decrease entropy.
Dynamic upper and lower clipping thresholds let you steer entropy instead of merely slowing its collapse.
The paper tests three practical schedules: Increase-then-Decrease (ID), Decrease-Increase-Decrease (DID), and Oscillatory Decay (OD).
For broader context, continue with LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering, GRPO Training Pipeline: SFT to RL for Better Reasoning, the Token Calculator, and the Reinforcement Learning Hub.

1. Understanding Policy Entropy Collapse in RLVR

In recent years, large language models like DeepSeek-R1 have made astonishing progress on complex tasks like mathematical reasoning and code generation, largely due to reinforcement learning fine-tuning techniques such as Reinforcement Learning with Verifiable Rewards (RLVR). As a prominent RLVR algorithm, GRPO is widely adopted for its simplicity and efficiency, as it does not require a separately trained critic network. However, a persistent challenge for researchers is that as training progresses, the model's policy entropy often plummets to near zero, a phenomenon known as policy entropy collapse.

What is Policy Entropy Collapse?

Think of policy entropy as a measure of the "creativity" or "diversity" of a model's output. High entropy means the model is exploring multiple possible solution paths. Low entropy, on the other hand, means the model has become "overly confident," tending to repeat a few fixed patterns.

Unfortunately, GRPO training often triggers a rapid decay of entropy, leading to this collapse. This has two severe consequences:

Premature Convergence: The model latches onto a few "seemingly effective" solution strategies early on, loses its ability to explore, and ultimately gets stuck in a local optimum.
Vanishing Gradients: As theoretical analysis by Shen et al. shows, the norm of the policy gradient is upper-bounded by the policy's entropy. An entropy collapse, therefore, directly leads to weaker gradients, making it incredibly difficult for the model to continue improving in later training stages.

2. The Role of Gradient Clipping in Entropy Collapse

To understand policy entropy collapse, we need to trace the issue to a core mechanism in RLVR: the gradient clipping used in Proximal Policy Optimization (PPO-Clip).

How PPO-Clip and Gradient Clipping Work

Proximal Policy Optimization (PPO) stabilizes training by limiting how much the new policy can deviate from the old one:

where is the importance sampling ratio. The objective function of PPO-Clip is:

By clipping within the range , PPO ensures that policy updates are not overly aggressive.

The Link Between PPO Clipping and Policy Entropy

Existing research (such as DAPO) has found that an overly strict clipping threshold can suppress the exploration of low-probability tokens, causing a continuous drop in entropy. The Clip-Higher strategy, proposed by DAPO, alleviates this problem by raising the upper bound threshold .

However, this leaves several key questions unanswered. What is the exact relationship between gradient clipping and entropy control? Can we use gradient clipping to precisely and flexibly increase and decrease entropy? And how should we manage entropy throughout the training process for optimal results? Our core contribution lies in answering these three questions.

3. A Theoretical Framework for Entropy Dynamics

A key contribution of our work is a precise theoretical characterization of the regions in the importance sampling ratio space that either promote entropy growth or cause it to decline.

We found that our work on this topic converges with recent findings from the team at Tongyi in their paper, "On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models." Their accompanying technical write-up, "Entropy Dynamics in RL Training," also explores this theory and other related work.

Core Derivation

For the single-step update surrogate objective:

We compute the inner product of the objective function's gradient and the global entropy gradient:

The two terms in the equation are the Token-specific term and the Global baseline term. The sign of this inner product, which indicates the direction of entropy change, is primarily determined by the token-specific term relative to the baseline. When we focus on the component of a specific token, we get the following relationship:

This leads to a crucial insight: the direction of entropy change depends on the relationship between a token's "surprisal" (how unexpected it is) and the current entropy.

The Four Entropy-Sensitive Regions

Based on our theoretical analysis, we identified four key regions:

Region E1: (low surprisal/high probability) & → Entropy Change: Decrease
Region E2: (high surprisal/low probability) & → Entropy Change: Increase
Region E3: (low surprisal/high probability) & → Entropy Change: Increase
Region E4: (high surprisal/low probability) & → Entropy Change: Decrease

In simpler terms:

E1/E4: Reinforcing "expected" tokens → Sharper distribution → Entropy decreases
E2/E3: Reinforcing "unexpected" tokens → Flatter distribution → Entropy increases

We verified this theory through controlled experiments. By applying gradient clipping only to specific regions, we observed entropy dynamics that perfectly matched our theoretical predictions.

4. A Solution: Dynamic Clipping for Flexible Entropy Control

Armed with these theoretical insights, we designed a dynamic clipping threshold mechanism to precisely control entropy by non-linearly modulating specific ratio regions.

Dynamic Upper Bound Clipping (to increase entropy)

When , the upper bound threshold mainly affects regions E1 (high probability) and E2 (low probability).

A fixed high threshold (like DAPO's clip-higher) can promote exploration in region E2, but it also risks over-optimization in region E1, which causes entropy to decrease. To avoid this trade-off, we designed the upper bound threshold to be a function of the token's current probability:

where is negatively correlated with :

For low-probability tokens: We loosen the threshold to encourage exploration (enhancing E2).
For high-probability tokens: We tighten the threshold to prevent over-optimization (suppressing E1).

Using a linear negative correlation, we get:

This yields the dynamic constraint:

Dynamic Lower Bound Clipping (to decrease entropy)

When , the situation is more nuanced. Negative advantage signals can be unstable; due to the normalizing nature of the softmax function, penalizing one token indirectly increases the probabilities of all other tokens.

For high-probability negative samples, a fixed high allows for excessive updates, leading to drastic shifts in the distribution. For low-probability negative samples, further suppressing their probability helps rule out suboptimal regions with limited impact on the overall distribution.

Therefore, we adopt the opposite dynamic strategy for the lower bound:

For high-probability tokens: We tighten the threshold to maintain stability (suppressing excessive entropy increase from E3).
For low-probability tokens: We loosen the threshold to reinforce their exclusion (enhancing the entropy-decreasing effect of E4).

5. Three Strategies for Effective Entropy Control in Training

Our dynamic clipping mechanism enables us to design flexible entropy control strategies. The core idea is simple: maintain high entropy early in training to promote exploration, then gradually decrease it later to achieve stable convergence.

Here, and are threshold functions that change dynamically with the training step .

Strategy 1: Increase-then-Decrease (ID)

This strategy divides training into two phases, with the midpoint of the total training steps as the boundary:

where , is the dynamic upper bound function, and is the dynamic lower bound function.

Phase 1: Use the dynamic upper bound to promote entropy growth, while the lower bound remains at the standard value.
Phase 2: Restore the upper bound to the standard value and use the dynamic lower bound to promote entropy decrease.

Strategy 2: Decrease-Increase-Decrease (DID)

The DID strategy allows entropy to first decrease naturally, then uses our mechanism to drive it back up before it collapses, and finally guides it toward convergence in the second half of training:

Strategy 3: Oscillatory Decay (OD)

While ID and DID divide training into discrete phases, the OD strategy allows the model to autonomously oscillate and decay its entropy throughout the entire process.

We define a time-varying entropy threshold:

where is the target lower bound for entropy.

We then introduce a discrete state variable (1 for entropy-increase mode, 0 for entropy-decrease mode) and control state transitions using hysteresis logic:

The clipping thresholds are then dynamically selected based on the current state:

6. Experimental Results

We conducted comprehensive experiments on Qwen2.5-Math-7B and Qwen2.5-7B using the DAPO-MATH dataset, evaluating performance on benchmarks such as AIME24, AIME25, AMC, MATH-500, GSM8K, and Olympiad.

Our three proposed strategies consistently outperformed GRPO, Clip-Higher, and other baseline methods across multiple benchmarks. The ID strategy, in particular, achieved state-of-the-art performance on AIME24, AMC, MATH-500, GSM8K, and Olympiad. For detailed results, please refer to our paper.

Analysis of Entropy and Reward Curves

The training curves reveal two key findings:

Effective Entropy Regulation: All three strategies successfully produced their intended entropy patterns.
Superior Late-Stage Performance: While our methods initially showed lower rewards, they significantly surpassed the baselines in later training stages, demonstrating more robust long-term learning.

Evaluation of Exploration Capability

We used the Pass@K metric to evaluate the model's exploration capability at 120 steps, a point where the model's training entropy was high. The results show that while all methods performed similarly on Pass@1, our methods demonstrated significantly better Pass@K performance as K increased. This indicates that our method preserves a richer set of potential solution strategies, preventing premature convergence.

Analysis of Phase Proportions

For the ID and DID strategies, we analyzed the impact of different phase proportions. The results show that a balanced 50/50 split achieved optimal performance:

Proportion too small (0.3/0.4): Entropy begins to decrease before reaching a sufficient peak.
Proportion too large (0.6): Convergence in the second phase is too rapid, which hurts final performance.

7. Conclusion, Further Analysis, and Future Work

This paper systematically investigates policy entropy control in RLVR from the theoretical perspective of gradient-preserving clipping. Our main contributions are threefold:

Theoretical Insight: We precisely characterized the mechanisms of four entropy-sensitive regions within the importance sampling ratio space.
Regulation Mechanism: We designed an entropy regulation method based on dynamic clipping thresholds that can independently control entropy increase and decrease.
Control Strategies: We proposed three effective entropy control strategies—ID, DID, and OD—that successfully mitigate entropy collapse and improve model performance.

We invite you to read the full paper on ArXiv for more in-depth experimental analyses, including correlations of clipping thresholds, token clipping probabilities, and comparisons with fixed-threshold methods, all of which validate the effectiveness and necessity of the dynamic clipping mechanism.

This work provides new theoretical tools and practical methods for understanding and controlling the training dynamics of large models in reinforcement learning. Looking ahead, the next logical step is more adaptive entropy control that reacts to live training signals instead of following only a preset schedule. If you are comparing this paper with broader post-training practice, it pairs well with our LLM RL engineering guide, the GRPO pipeline walkthrough, and the broader Reinforcement Learning Hub.

References

Shen et al. “On Entropy Control in LLM-RL Algorithms.” https://arxiv.org/abs/2509.03493
Schulman et al. “Proximal Policy Optimization Algorithms.” https://arxiv.org/abs/1707.06347
Yu et al. “DAPO: An Open-Source LLM Reinforcement Learning System at Scale.” https://arxiv.org/abs/2503.14476
Wang et al. “On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models”. https://arxiv.org/abs/2602.03392

Reinforcement Learning Hub

Flexible Entropy Control in RLVR: Fixing Policy Entropy Collapse with Dynamic Clipping