Reinforcement Learning Hub

Hub

Advanced LLM training techniques

Explore Reinforcement Learning Hub Hub
Technology

12 Practical Lessons for RL Training: Hard-Won Insights from Production

Discover 12 battle-tested lessons from months of production RL training. Learn why stability trumps everything, how agentic RL differs from reasoning RL, and practical strategies to avoid reward hacking in LLM training pipelines.
Chi Guo Dong Bu Tu Guo Dong Pi
4 min read
#Reinforcement Learning#RL training#agentic RL#reward hacking#PPO#RLHF#LLM training#RL scaling

Over the past few months, I have been deeply involved in training Reinforcement Learning (RL) agents for scenarios ranging from search to data analysis. My work has spanned small dense models to massive MoE architectures in both single-source and mixed-data environments. This experience has provided a clear view of what succeeds and what fails in practical RL training. Here are some key lessons from the field:

  1. RL Training Stability is Everything. In any production-level scenario, stability is the absolute prerequisite for successful Reinforcement Learning. A solid RL pipeline should run like clockwork, training efficiently over long periods—that is the only way to achieve true scaling. Poor stability does not just slow you down; it burns through time and resources. I have spent a lot of time exploring this, running comparative experiments on training-inference mismatch, ppo-ewma, and other factors. Many of these findings have since become the default configurations in our agentic RL training pipelines.

  2. Agentic RL is Fundamentally Traditional RL. While related to Reasoning RL, agentic RL shares more DNA with classic reinforcement learning. If we consider RL's three core components—environment, reward, and algorithm—their typical priority in Reasoning RL is algorithm > reward > environment. In practical agent applications, however, this hierarchy is inverted: environment > reward > algorithm.

  3. Robust Tool-Handling in the RL Environment is Non-Negotiable. An RL training environment that cannot handle tool calls reliably will undermine the entire process. It leads to significant experimental costs and limits the model's ultimate potential. Our RL training pipelines now constantly monitor the failure rates of all tool calls. Before beginning RL training for a new scenario, our first priority is to resolve any tool-calling issues to ensure the environment can manage large-scale, concurrent invocations.

  4. LLM-as-Judge Can Cause Reward Hacking. In agent scenarios lacking easily verifiable rewards (e.g., math or code generation), using an LLM-as-judge creates significant risk of reward hacking. For instance, I recall a month where our test set scores skyrocketed on three separate occasions. Each time, we later discovered the agent had not improved its core capability but had instead found a new exploit in the reward system.

  5. Be Wary of Manual Reward Engineering. Keep manual reward engineering to an absolute minimum. If you cannot avoid it, be prepared to iterate relentlessly, because reward hacking is always lurking around the corner.

  6. Align Training and Evaluation Environments. Mismatched RL environments can be highly deceptive. If perfect alignment is not possible (e.g., due to different tool requirements), you must verify that the evaluation environment's scores behave as expected. Otherwise, you risk misdiagnosing the problem, attempting to fix a training issue when the root cause is a discrepancy in the evaluation setup, such as a tool that truncates outputs and artificially depresses scores.

  7. Build a Solid Foundation First. Avoid running ablation studies on data and algorithms until the environment and reward systems are stable. These foundational RL components must be thoroughly monitored and stress-tested to confirm their readiness for rigorous experimentation.

  8. When You Have the Compute, Use It for RL Scaling. With sufficient GPU resources, ppo-ewma can be a better choice than on-policy methods for pushing the performance envelope. To get the most out of it, scale up your batch size and group size to expand your RL compute, which will boost both learning efficiency and exploration.

  9. The RL Grokking Phenomenon is Real. The RL grokking phenomenon, where a model suddenly generalizes after a long period of stagnation, is real and deserves more attention. Over the past few months, I have seen it happen numerous times, and it seems particularly pronounced in on-policy experiments.

  10. Monitor Tool-Level Exploration. If your environment provides the agent with multiple tools or files, you have to track how they are being used. If an agent consistently underutilizes or ignores a critical tool, it is effectively tying one hand behind its back, capping its potential performance.

  11. Larger Models Learn Faster in RL Training. Avoid forcing RL to work on small models with clever workarounds. These techniques often fail to scale and become dead ends when transitioning to larger architectures. In my experience, RL training generalizes faster and more effectively on larger models.

  12. Trust the Process in Continual RL. In continual RL, do not be alarmed by low initial entropy. It is not always necessary to use techniques like clip higher to force exploration. Provided the model receives high-quality data, it will often relax its own probability distribution, naturally increasing exploration and entropy over time.

Further Reading

Explore More in Reinforcement Learning Hub

This article is part of our Reinforcement Learning series. Discover more insights and practical guides.

Visit Reinforcement Learning Hub

About This Article

Topic: Technology
Difficulty: Intermediate
Reading Time: 4 minutes
Last Updated: December 9, 2025

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge