What is the most important factor for successful RL training at production scale?

Stability is the absolute prerequisite. A solid RL pipeline should run like clockwork over long training periods—this is the only way to achieve true scaling. Poor stability burns through time and resources. Key factors include addressing training-inference mismatch, optimizing PPO-EWMA configurations, and thoroughly testing the environment before running ablation studies.

How does agentic RL differ from reasoning RL?

While both relate to reinforcement learning for LLMs, they prioritize RL's three core components differently. Reasoning RL prioritizes: algorithm > reward > environment. Agentic RL inverts this: environment > reward > algorithm. This means agentic RL success depends more on robust tool handling and environment stability than algorithmic innovations.

What is reward hacking and how do you prevent it in LLM training?

Reward hacking occurs when the model exploits the reward system rather than improving actual capability. Using LLM-as-judge in scenarios without easily verifiable rewards (like agent tasks vs math/code) creates significant risk. Prevention strategies: minimize manual reward engineering, iterate relentlessly when you must use it, and be suspicious when test scores suddenly spike—verify the model actually improved rather than found a new exploit.

Should I use on-policy or PPO-EWMA methods for RL scaling?

With sufficient GPU resources, PPO-EWMA can outperform on-policy methods for pushing performance. Key optimization: scale up batch size and group size to expand RL compute, which boosts both learning efficiency and exploration. However, the 'RL grokking' phenomenon—sudden generalization after long stagnation—appears particularly pronounced in on-policy experiments.

Why do larger models learn faster in RL training?

Avoid forcing RL to work on small models with clever workarounds. These techniques often fail to scale and become dead ends when transitioning to larger architectures. RL training generalizes faster and more effectively on larger models. Focus engineering effort on scaling rather than making small models work through hacks.

12 Practical Lessons for RL Training: Hard-Won Insights from Production

Over the past few months, I have been deeply involved in training Reinforcement Learning (RL) agents for scenarios ranging from search to data analysis. My work has spanned small dense models to massive MoE architectures in both single-source and mixed-data environments. This experience has provided a clear view of what succeeds and what fails in practical RL training. Here are some key lessons from the field:

RL Training Stability is Everything. In any production-level scenario, stability is the absolute prerequisite for successful Reinforcement Learning. A solid RL pipeline should run like clockwork, training efficiently over long periods—that is the only way to achieve true scaling. Poor stability does not just slow you down; it burns through time and resources. I have spent a lot of time exploring this, running comparative experiments on training-inference mismatch, ppo-ewma, and other factors. Many of these findings have since become the default configurations in our agentic RL training pipelines.
Agentic RL is Fundamentally Traditional RL. While related to Reasoning RL, agentic RL shares more DNA with classic reinforcement learning. If we consider RL's three core components—environment, reward, and algorithm—their typical priority in Reasoning RL is algorithm > reward > environment. In practical agent applications, however, this hierarchy is inverted: environment > reward > algorithm.
Robust Tool-Handling in the RL Environment is Non-Negotiable. An RL training environment that cannot handle tool calls reliably will undermine the entire process. It leads to significant experimental costs and limits the model's ultimate potential. Our RL training pipelines now constantly monitor the failure rates of all tool calls. Before beginning RL training for a new scenario, our first priority is to resolve any tool-calling issues to ensure the environment can manage large-scale, concurrent invocations.
LLM-as-Judge Can Cause Reward Hacking. In agent scenarios lacking easily verifiable rewards (e.g., math or code generation), using an LLM-as-judge creates significant risk of reward hacking. For instance, I recall a month where our test set scores skyrocketed on three separate occasions. Each time, we later discovered the agent had not improved its core capability but had instead found a new exploit in the reward system.
Be Wary of Manual Reward Engineering. Keep manual reward engineering to an absolute minimum. If you cannot avoid it, be prepared to iterate relentlessly, because reward hacking is always lurking around the corner.
Align Training and Evaluation Environments. Mismatched RL environments can be highly deceptive. If perfect alignment is not possible (e.g., due to different tool requirements), you must verify that the evaluation environment's scores behave as expected. Otherwise, you risk misdiagnosing the problem, attempting to fix a training issue when the root cause is a discrepancy in the evaluation setup, such as a tool that truncates outputs and artificially depresses scores.
Build a Solid Foundation First. Avoid running ablation studies on data and algorithms until the environment and reward systems are stable. These foundational RL components must be thoroughly monitored and stress-tested to confirm their readiness for rigorous experimentation.
When You Have the Compute, Use It for RL Scaling. With sufficient GPU resources, ppo-ewma can be a better choice than on-policy methods for pushing the performance envelope. To get the most out of it, scale up your batch size and group size to expand your RL compute, which will boost both learning efficiency and exploration.
The RL Grokking Phenomenon is Real. The RL grokking phenomenon, where a model suddenly generalizes after a long period of stagnation, is real and deserves more attention. Over the past few months, I have seen it happen numerous times, and it seems particularly pronounced in on-policy experiments.
Monitor Tool-Level Exploration. If your environment provides the agent with multiple tools or files, you have to track how they are being used. If an agent consistently underutilizes or ignores a critical tool, it is effectively tying one hand behind its back, capping its potential performance.
Larger Models Learn Faster in RL Training. Avoid forcing RL to work on small models with clever workarounds. These techniques often fail to scale and become dead ends when transitioning to larger architectures. In my experience, RL training generalizes faster and more effectively on larger models.
Trust the Process in Continual RL. In continual RL, do not be alarmed by low initial entropy. It is not always necessary to use techniques like clip higher to force exploration. Provided the model receives high-quality data, it will often relax its own probability distribution, naturally increasing exploration and entropy over time.

Reinforcement Learning Hub

12 Practical Lessons for RL Training: Hard-Won Insights from Production

Further Reading

Train 671B DeepSeek V3: RLHF Guide

OpenRLHF vs veRL: Ray Framework Deep Dive

Replicate DeepSeek R1 with RL

Explore More in Reinforcement Learning Hub

Related Articles in Reinforcement Learning Hub

Why Your LLM RL Training Keeps Crashing: 6 Months of Hard Lessons

OpenRLHF vs veRL: Ray Framework Deep Dive for Distributed RLHF (2025)

GRPO-RoC Explained: Better Training for Tool-Augmented AI (Complete Guide)