What is the difference between full-token and partial-token optimization in RL for LLMs?

Full-token optimization updates the policy using every generated token in a response, while partial-token optimization filters the update set so only selected tokens contribute to the loss. The first emphasizes global stability, and the second tries to improve signal quality and reduce noisy gradients.

Why did RL4LLM methods move beyond GRPO?

GRPO established a lightweight, critic-free baseline, but later methods targeted concrete weaknesses such as length bias, entropy collapse, train-infer inconsistency, and noisy token-level gradients. DAPO, GSPO, SAPO, Beyond the 80/20 Rule, and STAPO each address a different part of that stability-efficiency tradeoff.

What does STAPO optimize differently from earlier RL4LLM methods?

STAPO identifies rare spurious tokens inside otherwise good responses and silences their contribution to the update. Instead of rewarding all tokens equally, it suppresses high-risk tokens that can destabilize training despite positive sequence-level rewards.

When should researchers care about partial-token optimization?

Partial-token optimization becomes especially relevant when token-level gradients are noisy, many tokens carry little learning signal, or the model suffers from entropy collapse and unstable convergence. In those cases, selective updates can improve efficiency and robustness.

Source article by He Zeyu, Tsinghua University. Original Chinese article: WeChat.

RL for LLMs: From Full-Token to Partial-Token Optimization

In recent years, large language models such as OpenAI's o1 series, DeepSeek's R1, and the Qwen series have demonstrated remarkable gains in stability and reasoning, particularly in complex, long-chain tasks like mathematical proofs and code generation.

This progress is increasingly driven by Reinforcement Learning for Large Language Models (RL4LLM), a core technique in the post-training alignment stage.

Compared to earlier methods that relied on Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), RL4LLM provides training signals more directly correlated with final performance. It accomplishes this by constructing optimization objectives around the correctness of a result or the quality of task completion. This approach is also widely considered a key factor in developing the chain-of-thought reasoning capabilities of modern models.

Following foundational policy gradient methods, the field of reinforcement learning for large models has evolved along a distinct path. To systematically map this development, this article classifies these methods based on the 'scope of tokens involved in optimization':

One class of methods optimizes all tokens in a generated sequence, emphasizing the modeling of global consistency.

Another class uses optimization signals from only a select subset of tokens to reduce noise and improve training stability.

Based on this criterion, existing methods can be broadly categorized into two main directions: 'Full-Token Optimization' and 'Partial-Token Optimization.'

TL;DR

Full-token optimization updates every generated token and evolves from GRPO to DAPO, GSPO, and SAPO.
Partial-token optimization filters for the tokens that matter most, as seen in Beyond the 80/20 Rule and STAPO.
The key tradeoff is straightforward: full-token methods prioritize global stability, while partial-token methods improve signal density and reduce noisy gradients.
For broader context, continue with LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering, Flexible Entropy Control in RLVR: Fixing Policy Entropy Collapse with Dynamic Clipping, the Token Calculator, and the Reinforcement Learning Hub.

Evolutionary Path

To make the following formulas easier to follow, let's standardize the notation:

is the input question or prompt.
is the -th response sequence generated by the model, and is its length.
is the -th token in the -th response.
and represent the current policy and the reference (old) policy, respectively.
represents the probability ratio of generating a token between the new and old policies.
represents the sequence-level advantage function.
represents the number of candidate responses obtained through group sampling for the same input.
(and later in the text) represents the clipping threshold in the policy update.

Full-Token Optimization in RL4LLM: From GRPO to SAPO

The core idea behind full-token optimization is to use every single output token to update the model's policy. This path, starting with the influential GRPO algorithm, has gradually evolved through DAPO, GSPO, and SAPO. The overarching goal has been to continuously improve training efficiency and gradient stability without sacrificing optimization strength.

GRPO: A Lightweight Approach to Policy Optimization

GRPO Diagram

Paper: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Introduction: GRPO marked a significant shift in the field, garnering substantial attention for its novel, resource-efficient approach.

Core Innovation: Previous reinforcement learning methods for LLM fine-tuning typically required separate networks to estimate value or rewards. GRPO cleverly sidesteps this by using a group sampling mechanism and a rule-based reward function. Its key innovation was eliminating the traditional critic model, which drastically reduces memory consumption.

SVG Formula

The trade-off, however, is that the diversity of positive and negative samples depends heavily on the intra-group sample size during the rollout phase. This can lead to lower sampling efficiency and potential training instability.

DAPO: Refining Policy with Asymmetric Clipping

DAPO Results

Paper: DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Introduction: DAPO refines GRPO with several fine-grained improvements across three key areas.

Core Innovation: The algorithm introduces a global token coefficient to normalize and stabilize gradient fluctuations caused by varying sequence lengths within a group. It also adopts asymmetric clipping ( and ) to prevent entropy collapse during training. Finally, DAPO uses dynamic sampling (discarding invalid sample groups where rewards are all 0s or all 1s) to boost efficiency and incorporates a soft penalty for output length.

SVG Formula

GSPO: A Sequence-Level View for MoE Models

GSPO Results

Paper: Group Sequence Policy Optimization

Introduction: The GSPO algorithm was developed primarily to tackle the training-inference inconsistency found in Mixture-of-Experts (MoE) models.

Core Innovation: GSPO's main improvement lies in its treatment of the importance sampling coefficient. It swaps the token-level probability ratio for the sequence-level probability ratio of the entire sentence, helping to overcome the instability that can arise from focusing on individual tokens:

SVG Formula

Optimization Objective: Accordingly, GSPO's objective function replaces the original token-level ratio with the sequence-level ratio :

SVG Formula

This modification better aligns with the goal of decreasing perplexity (PPL) during pre-training and effectively mitigates the expert routing inconsistency between the Rollout Engine and the Model Engine in MoE architectures.

SAPO: Using a Soft Trust Region for Policy Updates

Diagram of Soft Gating Mechanism in SAPO

Paper: Soft Adaptive Policy Optimization

Introduction: SAPO strikes a balance between the model's exploitation and exploration by modifying the traditional hard clipping mechanism.

Core Innovation: SAPO ditches the rigid clip operation in favor of a token-level soft trust region (Soft Control):

SVG Formula

By designing asymmetric temperature parameters for positive and negative tokens (using when the advantage is positive, and otherwise), it achieves more fine-grained control over the policy updates.

SVG Formula

Partial-Token Optimization: A Targeted RL4LLM Strategy

While full-token algorithms were busy refining stability, the research community started asking a fundamental question: Are all tokens in a response equally important for policy updates? Which tokens truly drive the learning process?

The reality is, many tokens—like common connecting words—provide very little useful gradient information. Worse, some plausible-sounding but hallucinatory tokens can introduce destructive noise into the training process. This realization gave rise to the cutting-edge branch of Partial-Token Optimization, which filters tokens using an indicator function so that only a select subset contributes to the loss calculation.

Beyond the 80/20: High-Entropy Token Optimization

Beyond the 80/20 Diagram

Paper: Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Introduction: The core idea of Beyond the 80/20 (also known as 20-Entropy) is to allocate the update budget based on a token's information content, moving away from the traditional all-tokens-in approach.

Core Innovation: The key insight here is that not all tokens are created equal. Low-information tokens can dilute the gradient, weakening the learning signal. To address this, the method first calculates each token's entropy and then constructs a filter mask using a threshold , retaining only high-entropy tokens for the policy update.

SVG Formula

This 'filter-first, optimize-later' approach concentrates computational resources where they matter most, increasing the density of effective gradients and leading to a more stable optimization signal for the same computational budget.

STAPO: Silencing Spurious Tokens for Stable Training

STAPO Diagram

Paper: STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Introduction: Building on the principle of targeting high-value tokens, STAPO—a recent method from a team at Tsinghua University—adopts a more surgical approach. Its core principle is to identify and neutralize the small fraction of 'spurious' tokens that can destabilize the training process. By modifying as little as 0.01% of tokens, it has achieved state-of-the-art performance.

Core Innovation: STAPO is the first method to explicitly define 'spurious tokens': harmful tokens that appear in a sequence with a positive final reward but also exhibit a large gradient norm, low entropy, and low probability. In essence, they are confidently generated but incorrect tokens within an otherwise good response, and reinforcing them can be destructive. STAPO therefore proposes the Silencing Spurious Tokens (S2T) mechanism:

SVG Formula

Here, and are the probability and entropy thresholds. This criterion flags tokens that meet a trifecta of conditions: 'low probability + low entropy + positive advantage.' When an overall sequence is correct, but a specific token within it is both unlikely (low probability) and generated with high confidence (low entropy), S2T flags it as spurious. It then nullifies its contribution to the loss, preventing a single problematic token from corrupting the gradient update.

Optimization Objective:

SVG Formula

The key insight behind STAPO is its precision. Without disrupting the overall coherence of the generated text, it removes the influence of a tiny fraction of spurious tokens (as little as 0.01%). This leads to comprehensive improvements in policy entropy, stability, and convergence, ultimately achieving leading performance among similar algorithms.

Conclusion: The Future of RL4LLM Optimization

In summary, the evolution of RL4LLM methods reveals a clear trajectory from 'full-scale, uniform updates' to a more sophisticated phase of 'fine-grained updates based on token contribution.'

The first wave, represented by GRPO, DAPO, GSPO, and SAPO, focused on optimizing training stability and efficiency across all tokens.
The second wave, led by methods like 20-Entropy and STAPO, enhances training by concentrating on high-value tokens and suppressing spurious ones. This boosts effective gradient density, stabilizes policy entropy, and improves convergence.

This paradigm shift from 'full coverage' to 'fine-grained filtering' is proving to be a critical factor in improving LLM training efficiency and strengthening logical coherence. Future research will likely push this granularity further, perhaps by dynamically adjusting the optimization scope based on task complexity or even modeling the interdependencies between tokens to create more robust and efficient learning signals.

Reinforcement Learning Hub

RL for LLMs: Full-Token vs Partial-Token Optimization

RL for LLMs: From Full-Token to Partial-Token Optimization

TL;DR

Full-Token Optimization in RL4LLM: From GRPO to SAPO

GRPO: A Lightweight Approach to Policy Optimization

DAPO: Refining Policy with Asymmetric Clipping

GSPO: A Sequence-Level View for MoE Models

SAPO: Using a Soft Trust Region for Policy Updates

Partial-Token Optimization: A Targeted RL4LLM Strategy

Beyond the 80/20: High-Entropy Token Optimization

STAPO: Silencing Spurious Tokens for Stable Training

Conclusion: The Future of RL4LLM Optimization

Further Reading

LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering

Flexible Entropy Control in RLVR: Fixing Policy Entropy Collapse with Dynamic Clipping

Token Calculator

Reinforcement Learning Hub

Explore More in Reinforcement Learning Hub

Related Articles in Reinforcement Learning Hub

MoE Post-Training Guide: Load Balancing, Routing Replay, and Expert Parallelism

WideSeek-R1: Width Scaling for Broad Information Seeking with Multi-Agent RL

Flexible Entropy Control in RLVR: Fixing Policy Entropy Collapse with Dynamic Clipping