Reinforcement Learning Hub

Hub

Advanced LLM training techniques

Explore Reinforcement Learning Hub Hub
Technology

Why Your LLM RL Training Keeps Crashing: 6 Months of Hard Lessons

After 6 months of LLM RL training failures and breakthroughs, I share battle-tested solutions for training collapse, GRPO instability, exploration bottlenecks, and why Thinking models need special handling. Practical fixes you can apply today.
Qing Ke Ai
10 min read
#LLM reinforcement learning#GRPO training#RL training stability#post-training LLMs#PPO vs GRPO#Thinking models#LLM exploration efficiency

When post-training Large Language Models (LLMs) with Reinforcement Learning (RL), two critical challenges emerge: exploration efficiency and training stability. After six months of intensive work in this area, I have gathered numerous insights from both successes and failures. As I transition to a new field, I want to share these hard-won experiences with the community. These observations are based on my personal understanding and should be considered as such.

Challenges in LLM Exploration Efficiency with RL

The first major challenge is exploration efficiency. For many researchers, the RL pipeline for LLMs can seem incredibly heavyweight. It requires managing a suite of models: the current policy model to calculate log probabilities (logprobs), a reference model for baseline logprobs, a system to store old logprobs, a critic model for Proximal Policy Optimization (PPO), and potentially a separate reward model.

That's four models plus a logprob record, a complexity that can be daunting. This is before even considering training efficiency, where more models introduce more potential bottlenecks and pipeline bubbles.

A classic example is the synchronization interval between the rollout and training phases. With Sync=1, you explore for one step to sample data, then use that batch to train for one step. In this scenario, achieving even 50% machine utilization is optimistic.

Increasing the sync interval allows the data to be off-policy, meaning you explore for multiple steps before a training run. The drawback is that this can introduce training instability. GRPO's importance sampling was designed to solve this problem, but it is not a panacea—it does not guarantee that training will not collapse.

Exploration latency in agent-based environments presents another significant hurdle. In the Webshop environment, for instance, spinning up 32 runners to create 32 environments for parallel exploration can require around 1.7TB of memory. The Retrieve step is also a major CPU bottleneck with high latency.

Tasks like math problems and Alfworld are more manageable, but the exploration costs for Mobile and GUI Agents are astronomical—far beyond my available resources.

This naturally leads to a workaround: instead of using real phones for sampling when building a Mobile Agent, one can mock an environment that mimics a phone's interface and captures screenshots. This allows for running multiple runners at a low sampling cost. The downside, of course, is that the simulated environment often lacks fidelity, and maintaining and updating the simulation to cover all corner cases is a considerable challenge.

Ironically, covering those corner cases is precisely RL's biggest selling point. Unlike Supervised Fine-Tuning (SFT), with RL, you only need to set up the environment and the prompts. Then, you can hand the process over to GRPO and the LLM to explore exhaustively. It can automatically discover and handle out-of-distribution (OOD) scenarios and corner cases.

From another perspective, a core goal in RL is to gather as many diverse, positive samples as possible. However, a potential pitfall is that traditional data synthesis mindsets do not always translate directly to RL.

For example, if you synthesize data using environmental feedback and natural language reflection, the logprob of the resulting trajectory is calculated with that extra context. This creates a mismatch with the logprob from direct inference, which only uses the original question. Theoretically, this means you cannot directly apply importance sampling.

A clever technique to circumvent this is called context distillation. That said, in practice, it is sometimes possible to proceed without importance sampling. The bottom line is that using various data synthesis strategies to boost the count of positive samples is almost always a beneficial move.

Ensuring Training Stability in LLM Reinforcement Learning

Next, we consider training stability. Many practitioners discover that RL for LLMs does not scale as gracefully as pre-training or SFT. A training run might proceed for several thousand steps and then suddenly collapse, with key metrics such as entropy, KL divergence, reward, PPO loss, and output length deviating sharply. Once this occurs, the model often fails to recover, regardless of additional data or compute.

I believe the technical report for either DeepSeek 3.2 or Qwen3 mentioned that the RL stage for training reasoning abilities used just 4,000 data samples, trained over multiple epochs. On one hand, the fact that reasoning can be trained with so little data demonstrates how incredibly data-efficient RL can be. On the other, it underscores the severity of the scaling challenges.

Another common stability issue is collapse during GRPO training. The sources of instability are numerous and often subtle. I will summarize a few I have encountered.

Infrastructure and Precision Issues

First, at the infrastructure level, many teams use vLLM or SGLang for exploration. However, due to floating-point precision differences and certain bugs, the sequence logprobs they generate are not always identical to those from a standard Hugging Face inference pipeline. (I believe there are GitHub issues discussing and addressing this).

The most common symptom is that when Sync=1, all data should theoretically be on-policy, yet a significant portion gets clipped by importance sampling. This clipped ratio often worsens as training progresses. A potential workaround is to avoid using the vLLM-generated logprobs directly. Instead, recalculate them for the prefill stage using a Hugging Face model to obtain the 'true' logprobs for training.

Choosing the Right Loss Function

Regarding the loss function, you have a choice between sequence-level and token-level loss. Two representative works here are GSPO and DAPO. In my experience, GSPO converges slightly slower for dense models but offers better stability. Furthermore, GSPO is optimized for Mixture-of-Experts (MoE) models, so if you're working with MoE models, GSPO is the clear choice.

In other scenarios, the differences between GRPO, DAPO, and Reinforce++ are less pronounced. Keep in mind that DAPO has limitations with long sequences, so it might struggle in multi-turn, non-mathematical dialogue tasks.

The Impact of Output Length and Filtering

Setting the output token length improperly can also trigger a collapse. For instance, if a task only requires 200 tokens per turn but you set the generation limit to 8192, you risk destabilizing the training process.

This is because if the model generates an excessively long, collapsed, or repetitive output during a rollout, that part of the trajectory can have an outsized impact on the token-level loss. If a smaller value is sufficient, use it. If you absolutely need a large output length, you must be vigilant about filtering out extra-long trajectories and other outliers.

Stabilizing Multi-turn Agents

When training multi-turn agents with smaller LLMs, their limited capabilities can cause them to lose track of the goal. A good practice is to restate the original objective and the last few actions in the prompt for each new turn.

As for the Sync value, larger is not always worse. I have seen cases where Sync=10 actually outperformed Sync=1. However, be cautious with fully asynchronous training. If you go that route, it is wise to pair it with a Priority Buffer to ensure newer, more relevant data is prioritized, which helps with stability.

The Role of Positive Sample Amplification

If your model's success rate on a specific task is low, you cannot just apply GRPO and hope for the best. You need to find ways to amplify the signal from positive samples in your loss calculation. This could mean using token-level filtering, sample-level filtering, or simply increasing the advantage weight of positive samples. If you fail to do this, negative samples will dominate the gradient, and the training will likely collapse.

Why is this so important? In traditional RL, the action space is small. Suppressing a wrong action naturally shifts the probability mass toward the correct ones. But in LLM-based RL, the 'action space' is vast—the entire vocabulary size raised to the power of the sequence length. Suppressing the probability of one bad sequence does not guarantee that probability will flow to a good one. That probability mass gets redistributed across a chaotic, high-dimensional space. This is precisely why you must amplify the signal from positive samples.

A quick note on PPO: if you have a verifiable, objective reward signal, it is often better to avoid PPO. A good rule of thumb is to use PPO for subjective tasks and GRPO for objective ones. This is because the critic model's value predictions can be unreliable, especially when dealing with ambiguous or conflicting data.

Base Model Selection for LLM Fine-Tuning with RL

If your goal is to publish a paper, your best bet is to start with the Instruct or Base models from series like Qwen2.5 or Mistral. Always perform a supervised fine-tuning (SFT) phase as a 'cold start' before diving into RL.

A specific note on the Qwen model: its <think> tag is not a single, atomic token in its vocabulary, so the model might not generate it reliably. Do not feel locked into using <think>; you can easily substitute it with a more vocabulary-friendly token to represent the thinking step.

My advice is to steer clear of the Llama series for this kind of RL fine-tuning work. Its inherent Chain-of-Thought (CoT) capabilities from pre-training are not as strong, and the results from RL fine-tuning can be highly unpredictable. The Qwen series, on the other hand, has already been post-trained to exhibit 'thinking' behaviors, which makes it a much more fertile ground for further RL training.

Post-training 'Thinking' Models: A Unique RL Challenge

The post-training of 'Thinking' or 'Reasoning' models deserves its own section because it presents a unique set of challenges. It is often more difficult to apply RL directly to an existing 'Thinking' model than it is to develop this capability in a standard Instruct model from scratch.

The reason is that their 'multi-turn' conversations are not true multi-turn dialogues. They are actually a series of concatenated single-turn dialogues where the context is modified at each step.

Take Qwen3, for example. To reduce context length, it deletes the thinking block from Turn 1 before generating Turn 2. This seemingly minor detail is incompatible with the assumptions of standard multi-turn GRPO. Standard GRPO is designed to work on a single, continuous multi-turn trajectory, masking out inputs and propagating gradients from the end back to the beginning. But Qwen3's context modification shatters this assumption. The training process becomes: train on Turn 1, then train on Turn 2 with a modified context, and so on.

This means that any paper or algorithm designed for standard multi-turn GRPO needs to be re-evaluated for this paradigm. Attempting to apply a generic multi-turn training loop to a model like Qwen3 is unlikely to succeed.

Things get even more complex with multiple tool calls. The Kimi model, for instance, might retain the Thinking block before each of the three tool calls in Turn 1. But when generating Turn 2, it deletes all of those Thinking blocks from Turn 1. The prompt templates for modern models are becoming astonishingly complex.

If you do not know the specific RL algorithm and context modification rules a 'Thinking' model was originally trained with, attempting direct post-training is extremely risky.

Another often-overlooked detail is the sampling temperature for these models. Major open-source models all specify recommended temperature and TopP settings for inference. But do you need to follow these during RL training? My recommendation is to sample at a temperature of 1.0 or the officially recommended setting during training, and then switch to the official guidelines for inference and evaluation.

I hope these field notes prove useful to others navigating this challenging but exciting area of research.

Further Reading

Explore More in Reinforcement Learning Hub

This article is part of our Reinforcement Learning series. Discover more insights and practical guides.

Visit Reinforcement Learning Hub

About This Article

Topic: Technology
Difficulty: Intermediate
Reading Time: 10 minutes
Last Updated: December 26, 2025

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge