Over the past few months, I have been deeply involved in training Reinforcement Learning (RL) agents for scenarios ranging from search to data analysis. My work has spanned small dense models to massive MoE architectures in both single-source and mixed-data environments. This experience has provided a clear view of what succeeds and what fails in practical RL training. Here are some key lessons from the field:
-
RL Training Stability is Everything. In any production-level scenario, stability is the absolute prerequisite for successful
Reinforcement Learning. A solidRLpipeline should run like clockwork, training efficiently over long periods—that is the only way to achieve truescaling. Poor stability does not just slow you down; it burns through time and resources. I have spent a lot of time exploring this, running comparative experiments ontraining-inference mismatch,ppo-ewma, and other factors. Many of these findings have since become the default configurations in ouragentic RLtraining pipelines. -
Agentic RLis Fundamentally Traditional RL. While related toReasoning RL,agentic RLshares more DNA with classic reinforcement learning. If we consider RL's three core components—environment,reward, andalgorithm—their typical priority inReasoning RLisalgorithm>reward>environment. In practical agent applications, however, this hierarchy is inverted:environment>reward>algorithm. -
Robust Tool-Handling in the RL Environment is Non-Negotiable. An
RL trainingenvironment that cannot handle tool calls reliably will undermine the entire process. It leads to significant experimental costs and limits the model's ultimate potential. OurRLtraining pipelines now constantly monitor the failure rates of all tool calls. Before beginningRLtraining for a new scenario, our first priority is to resolve any tool-calling issues to ensure the environment can manage large-scale, concurrent invocations. -
LLM-as-JudgeCan Cause Reward Hacking. In agent scenarios lacking easily verifiablerewards(e.g., math or code generation), using anLLM-as-judgecreates significant risk ofreward hacking. For instance, I recall a month where our test set scores skyrocketed on three separate occasions. Each time, we later discovered the agent had not improved its core capability but had instead found a new exploit in the reward system. -
Be Wary of Manual Reward Engineering. Keep manual
rewardengineering to an absolute minimum. If you cannot avoid it, be prepared to iterate relentlessly, becausereward hackingis always lurking around the corner. -
Align Training and Evaluation Environments. Mismatched
RLenvironments can be highly deceptive. If perfect alignment is not possible (e.g., due to different tool requirements), you must verify that the evaluation environment's scores behave as expected. Otherwise, you risk misdiagnosing the problem, attempting to fix a training issue when the root cause is a discrepancy in the evaluation setup, such as a tool that truncates outputs and artificially depresses scores. -
Build a Solid Foundation First. Avoid running ablation studies on data and algorithms until the
environmentandrewardsystems are stable. These foundationalRLcomponents must be thoroughly monitored and stress-tested to confirm their readiness for rigorous experimentation. -
When You Have the Compute, Use It for RL Scaling. With sufficient GPU resources,
ppo-ewmacan be a better choice thanon-policymethods for pushing the performance envelope. To get the most out of it, scale up yourbatch sizeandgroup sizeto expand yourRL compute, which will boost both learning efficiency and exploration. -
The
RL GrokkingPhenomenon is Real. TheRL grokkingphenomenon, where a model suddenly generalizes after a long period of stagnation, is real and deserves more attention. Over the past few months, I have seen it happen numerous times, and it seems particularly pronounced inon-policyexperiments. -
Monitor Tool-Level Exploration. If your environment provides the agent with multiple tools or files, you have to track how they are being used. If an agent consistently underutilizes or ignores a critical tool, it is effectively tying one hand behind its back, capping its potential performance.
-
Larger Models Learn Faster in RL Training. Avoid forcing
RLto work on small models with clever workarounds. These techniques often fail to scale and become dead ends when transitioning to larger architectures. In my experience,RLtraining generalizes faster and more effectively on larger models. -
Trust the Process in
Continual RL. Incontinual RL, do not be alarmed by low initialentropy. It is not always necessary to use techniques likeclip higherto force exploration. Provided the model receives high-quality data, it will often relax its own probability distribution, naturally increasing exploration andentropyover time.