Reinforcement Learning Hub

Hub

Advanced LLM training techniques

Explore Reinforcement Learning Hub Hub
Technology

RL for LLMs: Full-Token vs Partial-Token Optimization

Compare full-token and partial-token optimization in RL for LLMs, from GRPO and DAPO to GSPO, SAPO, Beyond the 80/20 Rule, and STAPO.
Qing Ke Ai
41 min read
#RL for LLMs#RL4LLM#GRPO#DAPO#GSPO#STAPO

Source article by He Zeyu, Tsinghua University. Original Chinese article: WeChat.

RL for LLMs: From Full-Token to Partial-Token Optimization

In recent years, large language models such as OpenAI's o1 series, DeepSeek's R1, and the Qwen series have demonstrated remarkable gains in stability and reasoning, particularly in complex, long-chain tasks like mathematical proofs and code generation.

This progress is increasingly driven by Reinforcement Learning for Large Language Models (RL4LLM), a core technique in the post-training alignment stage.

Compared to earlier methods that relied on Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), RL4LLM provides training signals more directly correlated with final performance. It accomplishes this by constructing optimization objectives around the correctness of a result or the quality of task completion. This approach is also widely considered a key factor in developing the chain-of-thought reasoning capabilities of modern models.

Following foundational policy gradient methods, the field of reinforcement learning for large models has evolved along a distinct path. To systematically map this development, this article classifies these methods based on the 'scope of tokens involved in optimization':

One class of methods optimizes all tokens in a generated sequence, emphasizing the modeling of global consistency.

Another class uses optimization signals from only a select subset of tokens to reduce noise and improve training stability.

Based on this criterion, existing methods can be broadly categorized into two main directions: 'Full-Token Optimization' and 'Partial-Token Optimization.'

TL;DR

Evolutionary Path

To make the following formulas easier to follow, let's standardize the notation:

  • is the input question or prompt.
  • is the -th response sequence generated by the model, and is its length.
  • is the -th token in the -th response.
  • and represent the current policy and the reference (old) policy, respectively.
  • SVG Formula represents the probability ratio of generating a token between the new and old policies.
  • represents the sequence-level advantage function.
  • represents the number of candidate responses obtained through group sampling for the same input.
  • (and later in the text) represents the clipping threshold in the policy update.

Full-Token Optimization in RL4LLM: From GRPO to SAPO

The core idea behind full-token optimization is to use every single output token to update the model's policy. This path, starting with the influential GRPO algorithm, has gradually evolved through DAPO, GSPO, and SAPO. The overarching goal has been to continuously improve training efficiency and gradient stability without sacrificing optimization strength.

GRPO: A Lightweight Approach to Policy Optimization

GRPO Diagram

Paper: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Introduction: GRPO marked a significant shift in the field, garnering substantial attention for its novel, resource-efficient approach.

Core Innovation: Previous reinforcement learning methods for LLM fine-tuning typically required separate networks to estimate value or rewards. GRPO cleverly sidesteps this by using a group sampling mechanism and a rule-based reward function. Its key innovation was eliminating the traditional critic model, which drastically reduces memory consumption.

SVG Formula

The trade-off, however, is that the diversity of positive and negative samples depends heavily on the intra-group sample size during the rollout phase. This can lead to lower sampling efficiency and potential training instability.

DAPO: Refining Policy with Asymmetric Clipping

DAPO Results

Paper: DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Introduction: DAPO refines GRPO with several fine-grained improvements across three key areas.

Core Innovation: The algorithm introduces a global token coefficient to normalize and stabilize gradient fluctuations caused by varying sequence lengths within a group. It also adopts asymmetric clipping ( and ) to prevent entropy collapse during training. Finally, DAPO uses dynamic sampling (discarding invalid sample groups where rewards are all 0s or all 1s) to boost efficiency and incorporates a soft penalty for output length.

SVG Formula

GSPO: A Sequence-Level View for MoE Models

GSPO Results

Paper: Group Sequence Policy Optimization

Introduction: The GSPO algorithm was developed primarily to tackle the training-inference inconsistency found in Mixture-of-Experts (MoE) models.

Core Innovation: GSPO's main improvement lies in its treatment of the importance sampling coefficient. It swaps the token-level probability ratio for the sequence-level probability ratio of the entire sentence, helping to overcome the instability that can arise from focusing on individual tokens:

SVG Formula

Optimization Objective: Accordingly, GSPO's objective function replaces the original token-level ratio with the sequence-level ratio :

SVG Formula

This modification better aligns with the goal of decreasing perplexity (PPL) during pre-training and effectively mitigates the expert routing inconsistency between the Rollout Engine and the Model Engine in MoE architectures.

SAPO: Using a Soft Trust Region for Policy Updates

Diagram of Soft Gating Mechanism in SAPO

Paper: Soft Adaptive Policy Optimization

Introduction: SAPO strikes a balance between the model's exploitation and exploration by modifying the traditional hard clipping mechanism.

Core Innovation: SAPO ditches the rigid clip operation in favor of a token-level soft trust region (Soft Control):

SVG Formula

By designing asymmetric temperature parameters for positive and negative tokens (using when the advantage is positive, and otherwise), it achieves more fine-grained control over the policy updates.

SVG Formula

Partial-Token Optimization: A Targeted RL4LLM Strategy

While full-token algorithms were busy refining stability, the research community started asking a fundamental question: Are all tokens in a response equally important for policy updates? Which tokens truly drive the learning process?

The reality is, many tokens—like common connecting words—provide very little useful gradient information. Worse, some plausible-sounding but hallucinatory tokens can introduce destructive noise into the training process. This realization gave rise to the cutting-edge branch of Partial-Token Optimization, which filters tokens using an indicator function so that only a select subset contributes to the loss calculation.

Beyond the 80/20: High-Entropy Token Optimization

Beyond the 80/20 Diagram

Paper: Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Introduction: The core idea of Beyond the 80/20 (also known as 20-Entropy) is to allocate the update budget based on a token's information content, moving away from the traditional all-tokens-in approach.

Core Innovation: The key insight here is that not all tokens are created equal. Low-information tokens can dilute the gradient, weakening the learning signal. To address this, the method first calculates each token's entropy and then constructs a filter mask using a threshold , retaining only high-entropy tokens for the policy update.

SVG Formula

This 'filter-first, optimize-later' approach concentrates computational resources where they matter most, increasing the density of effective gradients and leading to a more stable optimization signal for the same computational budget.

STAPO: Silencing Spurious Tokens for Stable Training

STAPO Diagram

Paper: STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Introduction: Building on the principle of targeting high-value tokens, STAPO—a recent method from a team at Tsinghua University—adopts a more surgical approach. Its core principle is to identify and neutralize the small fraction of 'spurious' tokens that can destabilize the training process. By modifying as little as 0.01% of tokens, it has achieved state-of-the-art performance.

Core Innovation: STAPO is the first method to explicitly define 'spurious tokens': harmful tokens that appear in a sequence with a positive final reward but also exhibit a large gradient norm, low entropy, and low probability. In essence, they are confidently generated but incorrect tokens within an otherwise good response, and reinforcing them can be destructive. STAPO therefore proposes the Silencing Spurious Tokens (S2T) mechanism:

SVG Formula

Here, and are the probability and entropy thresholds. This criterion flags tokens that meet a trifecta of conditions: 'low probability + low entropy + positive advantage.' When an overall sequence is correct, but a specific token within it is both unlikely (low probability) and generated with high confidence (low entropy), S2T flags it as spurious. It then nullifies its contribution to the loss, preventing a single problematic token from corrupting the gradient update.

Optimization Objective:

SVG Formula

The key insight behind STAPO is its precision. Without disrupting the overall coherence of the generated text, it removes the influence of a tiny fraction of spurious tokens (as little as 0.01%). This leads to comprehensive improvements in policy entropy, stability, and convergence, ultimately achieving leading performance among similar algorithms.

Conclusion: The Future of RL4LLM Optimization

In summary, the evolution of RL4LLM methods reveals a clear trajectory from 'full-scale, uniform updates' to a more sophisticated phase of 'fine-grained updates based on token contribution.'

  • The first wave, represented by GRPO, DAPO, GSPO, and SAPO, focused on optimizing training stability and efficiency across all tokens.
  • The second wave, led by methods like 20-Entropy and STAPO, enhances training by concentrating on high-value tokens and suppressing spurious ones. This boosts effective gradient density, stabilizes policy entropy, and improves convergence.

This paradigm shift from 'full coverage' to 'fine-grained filtering' is proving to be a critical factor in improving LLM training efficiency and strengthening logical coherence. Future research will likely push this granularity further, perhaps by dynamically adjusting the optimization scope based on task complexity or even modeling the interdependencies between tokens to create more robust and efficient learning signals.

Further Reading

Explore More in Reinforcement Learning Hub

This article is part of our Reinforcement Learning series. Discover more insights and practical guides.

Visit Reinforcement Learning Hub

About This Article

Topic: Technology
Difficulty: Intermediate
Reading Time: 41 minutes
Last Updated: April 1, 2026

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge