Reinforcement Learning Hub

Hub

Advanced LLM training techniques

Explore Reinforcement Learning Hub Hub
Technology

LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering

A practical LLM Reinforcement Learning guide covering REINFORCE to PPO/GRPO derivations, plus production engineering patterns like async rollouts, importance sampling, and token-stream stability.
Chi Guo Dong Bu Tu Guo Dong Pi
8 min read
#LLM Reinforcement Learning#Reinforcement Learning for LLMs#GRPO#PPO#RLHF#LLM post-training

As a researcher in LLM Reinforcement Learning (RL), I have spent the past year immersed in its practical challenges and theoretical underpinnings.

Ever since the release of DeepSeek-R1 in early '25, LLM RL has become a major focus of research. In its wake, a family of PPO-style methods, building on improvements from GRPO, has proliferated. Researchers from all corners of the field are now trying to apply RL to push the boundaries of model performance.

I entered the LLM space in early '25 with limited reinforcement learning experience. Since then, I have revisited the theoretical derivations multiple times and gained hands-on experience with practical LLM Reinforcement Learning (RL), giving me a deep appreciation for the engineering hurdles involved. In this post, I will distill these learnings to guide beginners through both the theory of LLM RL and the key engineering practices from the ground up.

TL;DR: LLM RL in One Page

The Theory of Reinforcement Learning: From REINFORCE to PPO

Reinforcement learning is a paradigm of machine learning where an agent learns by doing. The core idea is simple: an agent continuously interacts with an environment, using the feedback it receives to adjust its behavior and improve its ability to solve a given problem.

This process mirrors how we humans learn new things, which gives RL powerful generalization capabilities in theory. The catch? Unlike traditional supervised learning, getting reinforcement learning models to converge can be a real challenge. Everything from the accuracy of the reward signal and the stability of parameter updates to the consistency of the environment can impact whether the training succeeds.

RL methods are typically split into two camps: Value-based and Policy-based, distinguished by what the model learns to predict. Given the massive action space of LLMs—essentially, all possible text—value-based methods are impractical. That's why we'll be focusing squarely on Policy-based approaches. (For those who want to learn RL from the very beginning, I highly recommend Professor Zhao Shiyu's course, "Mathematical Principles of Reinforcement Learning" [1]).

To make things easier to follow, let's walk through the derivation of the most classic Policy-based algorithm: REINFORCE. First, the objective function for REINFORCE is to maximize the expected reward of all possible trajectories under the current policy:

REINFORCE objective function

This aligns perfectly with our intuition for RL: the higher the expected reward of the trajectories a policy generates, the better the policy. So, when we use gradient ascent to update the model's parameters, our formula looks like this:

Policy gradient ascent update rule

Next, let's solve for the gradient:

REINFORCE gradient derivation steps

Here's a breakdown of the steps:

  • (1) -> (2): We expand the expectation according to its definition.
  • (3) -> (4): We use the common "log-derivative trick" to introduce a logarithm.
  • (5) -> (6): We rewrite the expression back into the form of an expectation.

The evolution of the REINFORCE algorithm eventually led to Actor-Critic, TRPO, and PPO. I won't detail the specific derivations here, but if you're interested, this blog post [2] offers a great explanation. Here is the PPO formula (with the clipping term omitted for simplicity; see the PPO paper [3] for full details):

Simplified PPO objective function

From Traditional RL to LLM Reinforcement Learning

Making the leap from traditional reinforcement learning to LLM RL doesn't require learning a whole new set of theories. It's more about mapping the familiar concepts onto the world of language models:

Traditional RL to LLM RL terminology mapping

The release of DeepSeek-R1 in early '25 really put the GRPO algorithm on the map. GRPO's key innovation is to calculate the advantage for each trajectory using relative rewards within a group of responses. This elegantly sidesteps the need for the separate Critic and Reward models that PPO relies on, which in turn dramatically reduces the GPU memory required for LLM-RL. This efficiency gain helped make RLVR (Reinforcement Learning with Verifiable Reward) one of the hottest topics of '25. The GRPO algorithm's formula is:

GRPO objective formula

Engineering Challenges in Production LLM RL

At a high level, ignoring GPU memory and runtime constraints, the LLM Reinforcement Learning loop appears straightforward: the model samples data and receives rewards, a forward pass calculates log probabilities, and a backward pass updates the model.

However, the reality is more complex. This is a key bottleneck in applying RL to LLMs, as the sampling step is exceptionally slow and resource-intensive.

This bottleneck is why most production-grade RL frameworks adopt a decoupled architecture, separating the training and inference (sampling) pipelines. The training framework might use a distributed setup like FSDP or Megatron, while the inference framework leverages tools like VLLM or SGLang to maximize sampling throughput.

Common LLM RL Frameworks

The open-source community has produced many excellent RL frameworks, including Verl, Areal, Slime, and AgentRL, to name a few.

As we've touched on, these frameworks almost universally use a decoupled architecture, though their specific implementations vary significantly. Many excellent posts already offer detailed comparisons (see framework comparison analyses), so I won't rehash them here. Instead, I'll focus on what I consider to be some of the most critical engineering improvements in modern RL.

Asynchronous Acceleration for Faster Rollouts

The rollout process in LLM RL is a significant time commitment. In a strictly on-policy setup, the model must wait for rollouts to complete before it can train. This sequential process leaves expensive GPUs idle and severely slows down the entire operation.

To circumvent this, most RL frameworks have introduced asynchronous rollout logic, often implemented using Python's asyncio to manage a buffer queue. The rollout engine continuously samples trajectories and pushes them into the buffer, while the actor and reference models pull batches of samples from this buffer to perform their forward and backward passes.

This pipelined approach ensures that with proper resource allocation, the GPUs remain utilized. Detailed explanations can be found in the papers for AgentRL [4] and AReal [5]. However, this introduces a trade-off: an asynchronously accelerated algorithm is no longer strictly on-policy. These off-policy elements can sometimes impede or destabilize convergence.

To mitigate this, frameworks employ a two-pronged strategy. First, they often include a staleness parameter or limit the buffer size to ensure the training data is not too many versions behind the current policy. Second, they apply a mathematical correction using importance sampling.

Correcting Drift with Importance Sampling

To maximize performance, modern RL frameworks use specialized inference engines like VLLM or SGLang for sampling. The problem is, these optimized engines can introduce subtle issues like precision loss. This means the log_probs calculated by the inference engine might not perfectly match those from the training framework, even for the exact same text. This seemingly small discrepancy can destabilize training and slow down convergence.

The solution, borrowed from the broader reinforcement learning world, is importance sampling. It provides a mathematical fix at the algorithm level to correct for this drift when you can't guarantee perfect consistency between training and inference. This is done by multiplying the objective function by a correction ratio.

Importance sampling correction in LLM RL

A detailed derivation and explanation can be found in this TIS blog post [6].

Building on this foundation, the community has also explored importance sampling for Mixture-of-Experts (MoE) models. MoE models introduce another variable: the expert router. If the set of activated experts differs between training and inference, you run into the same inconsistency problem. To stabilize RL training for MoE models, an improvement on the TIS correction was introduced that uses a per-token mask to clip the ratio. The formula is as follows:

MoE token-level clipped importance sampling

MoE routing mask correction formula

For a detailed derivation and more information, you can refer to IcePop [7] and MiniRL [8].

Stable Training: Maintaining the Token Stream

In Agentic LLM RL involving multi-turn conversations or tasks, a common pitfall is re-tokenizing the entire conversation history for each new generation. While seemingly intuitive, this practice can lead to significant training instability.

Why? Because the tokenizer might make different decisions on subsequent turns. A sequence of characters that was tokenized one way in turn 3 might be grouped differently when it's part of the history in turn 4. This can lead to bizarre, anomalous probabilities for certain tokens.

As the conversation lengthens, this tokenization instability can compound and ultimately derail the training process.

The solution is to manage the state at the token level. Instead of re-tokenizing text, the training process should maintain a persistent list of token IDs. When new feedback arrives from the environment, the corresponding new tokens are simply appended to this list. This incremental approach prevents re-tokenization issues and is crucial for stable training.

References

[1] Mathematical Principles of Reinforcement Learning [2] blog [3] PPO [4] AgentRL [5] AReal [6] TIS blog [7] IcePop [8] MiniRL

Further Reading

Explore More in Reinforcement Learning Hub

This article is part of our Reinforcement Learning series. Discover more insights and practical guides.

Visit Reinforcement Learning Hub

About This Article

Topic: Technology
Difficulty: Intermediate
Reading Time: 8 minutes
Last Updated: February 27, 2026

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge