What is the core difference between PPO and GRPO for LLM RL?

PPO optimizes policy updates using clipped importance ratios and typically depends on value estimates from a critic. GRPO removes explicit critic/reward-model dependence by using group-relative rewards to compute advantages, which often lowers memory requirements and simplifies large-scale LLM RL training.

Why is rollout often the bottleneck in LLM reinforcement learning?

LLM rollout requires expensive autoregressive generation and environment interaction, which is significantly slower than a single forward/backward pass. Without asynchronous design, GPUs can stay idle while waiting for samples, reducing end-to-end training throughput.

How does importance sampling help in train-infer decoupled RL pipelines?

When inference engines (for sampling) and training engines produce slightly different log probabilities, importance sampling adds correction ratios to the objective so updates remain better calibrated. This helps reduce instability caused by train-infer mismatch.

What makes multi-turn token-stream management critical in agentic RL?

If each turn re-tokenizes full history from raw text, token boundaries can shift between turns and introduce unstable probabilities. Maintaining a persistent token-id stream and appending increments avoids this drift and improves training stability.

Where should beginners start before implementing industrial LLM RL?

Start with policy-gradient fundamentals (REINFORCE), then study PPO/GRPO objective design, and finally learn production patterns such as asynchronous rollout, staleness control, and token-stream consistency. This order minimizes confusion and speeds practical onboarding.

LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering

As a researcher in LLM Reinforcement Learning (RL), I have spent the past year immersed in its practical challenges and theoretical underpinnings.

Ever since the release of DeepSeek-R1 in early '25, LLM RL has become a major focus of research. In its wake, a family of PPO-style methods, building on improvements from GRPO, has proliferated. Researchers from all corners of the field are now trying to apply RL to push the boundaries of model performance.

I entered the LLM space in early '25 with limited reinforcement learning experience. Since then, I have revisited the theoretical derivations multiple times and gained hands-on experience with practical LLM Reinforcement Learning (RL), giving me a deep appreciation for the engineering hurdles involved. In this post, I will distill these learnings to guide beginners through both the theory of LLM RL and the key engineering practices from the ground up.

TL;DR: LLM RL in One Page

Start with REINFORCE to understand policy gradients from first principles.
Use PPO as the practical baseline, then compare its design tradeoffs with GRPO for memory-efficient post-training.
For production systems, focus on asynchronous rollouts, staleness control, and importance-sampling correction.
In multi-turn agentic RL, keep an incremental token-id stream instead of re-tokenizing full history every turn.
Continue with related resources:

The Theory of Reinforcement Learning: From REINFORCE to PPO

Reinforcement learning is a paradigm of machine learning where an agent learns by doing. The core idea is simple: an agent continuously interacts with an environment, using the feedback it receives to adjust its behavior and improve its ability to solve a given problem.

This process mirrors how we humans learn new things, which gives RL powerful generalization capabilities in theory. The catch? Unlike traditional supervised learning, getting reinforcement learning models to converge can be a real challenge. Everything from the accuracy of the reward signal and the stability of parameter updates to the consistency of the environment can impact whether the training succeeds.

RL methods are typically split into two camps: Value-based and Policy-based, distinguished by what the model learns to predict. Given the massive action space of LLMs—essentially, all possible text—value-based methods are impractical. That's why we'll be focusing squarely on Policy-based approaches. (For those who want to learn RL from the very beginning, I highly recommend Professor Zhao Shiyu's course, "Mathematical Principles of Reinforcement Learning" [1]).

To make things easier to follow, let's walk through the derivation of the most classic Policy-based algorithm: REINFORCE. First, the objective function for REINFORCE is to maximize the expected reward of all possible trajectories under the current policy:

REINFORCE objective function

This aligns perfectly with our intuition for RL: the higher the expected reward of the trajectories a policy generates, the better the policy. So, when we use gradient ascent to update the model's parameters, our formula looks like this:

Policy gradient ascent update rule

Next, let's solve for the gradient:

REINFORCE gradient derivation steps

Here's a breakdown of the steps:

(1) -> (2): We expand the expectation according to its definition.
(3) -> (4): We use the common "log-derivative trick" to introduce a logarithm.
(5) -> (6): We rewrite the expression back into the form of an expectation.

The evolution of the REINFORCE algorithm eventually led to Actor-Critic, TRPO, and PPO. I won't detail the specific derivations here, but if you're interested, this blog post [2] offers a great explanation. Here is the PPO formula (with the clipping term omitted for simplicity; see the PPO paper [3] for full details):

Simplified PPO objective function

From Traditional RL to LLM Reinforcement Learning

Making the leap from traditional reinforcement learning to LLM RL doesn't require learning a whole new set of theories. It's more about mapping the familiar concepts onto the world of language models:

Traditional RL to LLM RL terminology mapping

The release of DeepSeek-R1 in early '25 really put the GRPO algorithm on the map. GRPO's key innovation is to calculate the advantage for each trajectory using relative rewards within a group of responses. This elegantly sidesteps the need for the separate Critic and Reward models that PPO relies on, which in turn dramatically reduces the GPU memory required for LLM-RL. This efficiency gain helped make RLVR (Reinforcement Learning with Verifiable Reward) one of the hottest topics of '25. The GRPO algorithm's formula is:

GRPO objective formula

Engineering Challenges in Production LLM RL

At a high level, ignoring GPU memory and runtime constraints, the LLM Reinforcement Learning loop appears straightforward: the model samples data and receives rewards, a forward pass calculates log probabilities, and a backward pass updates the model.

However, the reality is more complex. This is a key bottleneck in applying RL to LLMs, as the sampling step is exceptionally slow and resource-intensive.

This bottleneck is why most production-grade RL frameworks adopt a decoupled architecture, separating the training and inference (sampling) pipelines. The training framework might use a distributed setup like FSDP or Megatron, while the inference framework leverages tools like VLLM or SGLang to maximize sampling throughput.

Common LLM RL Frameworks

The open-source community has produced many excellent RL frameworks, including Verl, Areal, Slime, and AgentRL, to name a few.

As we've touched on, these frameworks almost universally use a decoupled architecture, though their specific implementations vary significantly. Many excellent posts already offer detailed comparisons (see framework comparison analyses), so I won't rehash them here. Instead, I'll focus on what I consider to be some of the most critical engineering improvements in modern RL.

Asynchronous Acceleration for Faster Rollouts

The rollout process in LLM RL is a significant time commitment. In a strictly on-policy setup, the model must wait for rollouts to complete before it can train. This sequential process leaves expensive GPUs idle and severely slows down the entire operation.

To circumvent this, most RL frameworks have introduced asynchronous rollout logic, often implemented using Python's asyncio to manage a buffer queue. The rollout engine continuously samples trajectories and pushes them into the buffer, while the actor and reference models pull batches of samples from this buffer to perform their forward and backward passes.

This pipelined approach ensures that with proper resource allocation, the GPUs remain utilized. Detailed explanations can be found in the papers for AgentRL [4] and AReal [5]. However, this introduces a trade-off: an asynchronously accelerated algorithm is no longer strictly on-policy. These off-policy elements can sometimes impede or destabilize convergence.

To mitigate this, frameworks employ a two-pronged strategy. First, they often include a staleness parameter or limit the buffer size to ensure the training data is not too many versions behind the current policy. Second, they apply a mathematical correction using importance sampling.

Correcting Drift with Importance Sampling

To maximize performance, modern RL frameworks use specialized inference engines like VLLM or SGLang for sampling. The problem is, these optimized engines can introduce subtle issues like precision loss. This means the log_probs calculated by the inference engine might not perfectly match those from the training framework, even for the exact same text. This seemingly small discrepancy can destabilize training and slow down convergence.

The solution, borrowed from the broader reinforcement learning world, is importance sampling. It provides a mathematical fix at the algorithm level to correct for this drift when you can't guarantee perfect consistency between training and inference. This is done by multiplying the objective function by a correction ratio.

Importance sampling correction in LLM RL

A detailed derivation and explanation can be found in this TIS blog post [6].

Building on this foundation, the community has also explored importance sampling for Mixture-of-Experts (MoE) models. MoE models introduce another variable: the expert router. If the set of activated experts differs between training and inference, you run into the same inconsistency problem. To stabilize RL training for MoE models, an improvement on the TIS correction was introduced that uses a per-token mask to clip the ratio. The formula is as follows:

MoE token-level clipped importance sampling

MoE routing mask correction formula

For a detailed derivation and more information, you can refer to IcePop [7] and MiniRL [8].

Stable Training: Maintaining the Token Stream

In Agentic LLM RL involving multi-turn conversations or tasks, a common pitfall is re-tokenizing the entire conversation history for each new generation. While seemingly intuitive, this practice can lead to significant training instability.

Why? Because the tokenizer might make different decisions on subsequent turns. A sequence of characters that was tokenized one way in turn 3 might be grouped differently when it's part of the history in turn 4. This can lead to bizarre, anomalous probabilities for certain tokens.

As the conversation lengthens, this tokenization instability can compound and ultimately derail the training process.

The solution is to manage the state at the token level. Instead of re-tokenizing text, the training process should maintain a persistent list of token IDs. When new feedback arrives from the environment, the corresponding new tokens are simply appended to this list. This incremental approach prevents re-tokenization issues and is crucial for stable training.

References

[1] Mathematical Principles of Reinforcement Learning [2] blog [3] PPO [4] AgentRL [5] AReal [6] TIS blog [7] IcePop [8] MiniRL

Reinforcement Learning Hub

LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering

TL;DR: LLM RL in One Page

The Theory of Reinforcement Learning: From REINFORCE to PPO

From Traditional RL to LLM Reinforcement Learning

Engineering Challenges in Production LLM RL

Common LLM RL Frameworks

Asynchronous Acceleration for Faster Rollouts

Correcting Drift with Importance Sampling

Stable Training: Maintaining the Token Stream

References

Further Reading

Reinforcement Learning for LLMs: RLHF & DPO Explained

GRPO Training Pipeline: SFT to RL for Better Reasoning

OpenRLHF vs veRL: Ray Framework Deep Dive

Token Calculator

Explore More in Reinforcement Learning Hub

Related Articles in Reinforcement Learning Hub

MoE Post-Training Guide: Load Balancing, Routing Replay, and Expert Parallelism

WideSeek-R1: Width Scaling for Broad Information Seeking with Multi-Agent RL

Flexible Entropy Control in RLVR: Fixing Policy Entropy Collapse with Dynamic Clipping