Why is MoE post-training harder than dense-model post-training?

MoE models add a routing decision on top of ordinary parameter updates. During supervised fine-tuning or RL, the system must improve task quality while also keeping experts meaningfully utilized. Small numerical changes can also flip expert selection, which makes train-inference mismatch and policy lag more damaging than in dense models.

What is the tradeoff between aux loss and model quality in MoE training?

Auxiliary load-balancing loss pushes tokens toward a more even expert distribution, but too much weight on that objective can interfere with the main learning signal. In practice, lower aux-loss weights often improve validation loss while worsening imbalance, so the right setting has to be tuned empirically rather than assumed.

What are R2 and R3 in Routing Replay?

R2, or Vanilla Routing Replay, replays the experts chosen during rollout and mainly reduces policy lag. R3, or Rollout Routing Replay, replays the experts selected by the inference engine inside training and mainly reduces train-inference routing mismatch. Smaller off-policy settings often tolerate R2, while larger off-policy settings usually need the stronger stabilization from R3.

When should I increase EP versus ETP for an MoE model?

Increase expert parallelism (EP) when you have many experts to distribute across devices. Increase expert tensor parallelism (ETP) when a single expert is too large to fit efficiently on one device and its weights need to be sharded further.

This article rebuilds the technical substance from the original Chinese post into a publish-ready English version. The source page's wrapper content and low-signal formatting issues were removed so the final post focuses on the MoE training mechanics themselves.

Original Chinese discussion: Zhihu
Source post: Qing Ke Ai WeChat article

MoE Post-Training: Why Sparse Models Become Harder After Pre-Training

Mixture-of-Experts architectures are now the mainstream choice for frontier large language models. They deliver high parameter capacity without paying dense-model compute on every token, which is why both open-source and commercial systems keep moving toward MoE designs.

But MoE gains do not make post-training easier. In supervised fine-tuning and reinforcement learning, the model is no longer just learning better outputs. It is also changing how tokens are routed across experts. That extra degree of freedom creates several distinct failure modes:

expert collapse when routing concentrates on a small subset of experts,
degraded task quality when load-balancing pressure becomes too strong,
RL instability when tiny numerical differences change the chosen experts,
deployment bottlenecks when expert placement no longer fits the available hardware topology.

If you want PPO and GRPO background first, start with LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering. If you want a deeper follow-up on one RL instability pattern, continue with Flexible Entropy Control in RLVR: Fixing Policy Entropy Collapse with Dynamic Clipping.

TL;DR

MoE post-training has to optimize both task quality and router behavior.
Load balancing is essential, but overly strong balancing can hurt the main objective.
aux_loss and loss-free balancing solve the same utilization problem with different tradeoffs.
In RL, MoE models amplify train-inference mismatch and policy lag because small score changes can switch experts completely.
Routing Replay is a practical stabilization tool: R2 mainly reduces policy lag, while R3 more aggressively reduces train-inference mismatch.
At the systems level, EP distributes experts across devices and ETP shards one expert across devices when an individual expert is too large.

1. Why MoE Post-Training Differs from Dense Models

An MoE layer combines two moving parts:

a set of experts, each with its own parameters,
a router or gating network that decides which experts a token should use.

That router implements conditional computation. Different tokens activate different experts, so only a subset of the full model participates in each forward pass.

This is exactly where post-training becomes tricky. In a dense model, optimization mainly changes how the same parameters respond. In an MoE model, optimization also changes which parameters get selected in the first place. If routing becomes skewed, some experts accumulate training opportunities while others become effectively idle.

The result is a familiar rich-get-richer loop:

A few experts get selected more often.
Those experts improve faster because they see more tokens.
The router becomes even more likely to select them again.
Less-used experts drift toward becoming "zombie experts" with little useful specialization.

That is why load balancing is not a cosmetic metric in MoE systems. It is a core training constraint.

The routing math behind that behavior is still the standard MoE formulation. For a token-level MoE layer, the module output is a weighted sum over the experts selected by the router:

MoE output equation

In practice, the router keeps only the top-k experts for each token and renormalizes their scores:

Top-k gating equation

Sparse router normalization equation

2. The First Core Problem: Load Balancing Without Damaging the Main Objective

The source article highlights two balancing families that show up repeatedly in production-scale MoE work.

Aux-Loss Balancing

The classic approach, associated with Switch Transformer style training, adds an auxiliary loss that encourages the actual token allocation and the router's average preference to stay closer to a uniform distribution.

The upside is straightforward: the optimization target explicitly pushes tokens toward underused experts, which reduces collapse early in training.

The downside is just as important: the auxiliary term is not the main task. If you turn it up too aggressively, its gradient can distort the model's real learning objective. The source article's main practical point is that this is not a theoretical corner case. It is a day-to-day hyperparameter tradeoff.

In Switch-style training, that tradeoff is usually written explicitly as an auxiliary balancing objective:

Aux-loss objective for MoE balancing

where one term tracks realized token assignment:

Actual expert-load proportion equation

and the other tracks the router's average preference:

Predicted expert-load proportion equation

In plain terms:

smaller aux-loss weights often improve the main task loss,
smaller aux-loss weights often worsen expert imbalance,
larger aux-loss weights improve utilization but can hurt convergence quality.

The cited experiment suggests that a value around 0.001 gave a usable compromise in that setup. That should be treated as a starting point, not a universal constant. The defensible engineering takeaway is simpler: tune aux loss by comparing both quality metrics and expert-load metrics, not by optimizing one side in isolation.

Loss-Free Balancing

The second family, associated with newer DeepSeek-style systems, avoids adding a balancing term directly to the loss. Instead, it changes expert selection behavior using a dynamic bias or score adjustment.

That design matters because it tries to improve balance without injecting a competing gradient into the main learning objective. Conceptually:

expert selection for top-k routing uses a bias-adjusted score,
expert weighting for the actual output can still use the original gating score,
the bias is updated from recent load statistics.

The key DeepSeek-style loss-free balancing equation captures that separation between selection and weighting:

Bias-adjusted routing equation for loss-free balancing

This does not mean loss-free balancing is automatically superior. It means the tradeoff surface changes. You are no longer balancing utilization by paying for an extra optimization objective. You are balancing it through routing control.

3. The Second Core Problem: RL Instability Becomes More Severe in MoE Models

The article's strongest systems point is that reinforcement learning is unusually fragile once routing itself can change.

Even in dense-model RL, two instability sources already exist:

Train-inference mismatch. Training and serving stacks do not always use identical kernels, precisions, or batching assumptions.
Policy lag. Rollouts are generated with one policy snapshot, but optimization may update that policy several times before all mini-batches are consumed.

MoE models amplify both.

Why MoE Magnifies Numerical Drift

In a dense model, a small numerical difference may slightly perturb the logits. In an MoE model, a small score difference can flip the top-k routing decision and send a token to different experts entirely.

That means tiny engine differences can create disproportionately large behavioral differences:

different kernels may produce slightly different router scores,
those score changes can switch the chosen experts,
different experts then produce different hidden states,
the mismatch compounds across layers.

This is a stronger form of train-inference mismatch than ordinary floating-point noise.

Why MoE Magnifies Policy Lag

Policy lag is also more damaging because a policy update changes more than the token distribution. It can also change which experts participate in generation and training. The article describes this as expert drift: parameters change, routing changes, and the two effects reinforce each other.

This is the reason MoE RL stabilization cannot be borrowed mechanically from dense-model recipes.

4. Routing Replay: The Practical Fix for MoE RL Stability

The article recommends Routing Replay as a concrete stabilization tool, with two variants.

R2: Vanilla Routing Replay

R2 replays the experts selected during rollout when gradients are computed later in training.

Its main target is policy lag. If the update step reuses the same expert choices that were active when the rollout was collected, the policy sees a more consistent training target and suffers less drift across mini-batch updates.

Use R2 when:

you are close to on-policy training,
off-policy distance is still relatively small,
you want lower bias without introducing the strongest replay constraint.

R3: Rollout Routing Replay

R3 goes further by replaying the routing selected by the inference or rollout engine inside training.

Its main target is train-inference mismatch. This matters when throughput-driven serving optimizations, kernel differences, or precision changes make rollout routing diverge from training routing.

Use R3 when:

the setup is more strongly off-policy,
stability matters more than training simplicity,
routing mismatch between engines is already visible in experiments.

The clean rule of thumb from the source article is:

smaller off-policy settings can often get away with MiniRL + R2,
larger off-policy settings more often need MiniRL + R3.

5. The Third Core Problem: Expert Parallelism Is a Hardware Decision, Not Just a Model Decision

Algorithmic stability is only half the problem. Once an MoE model reaches deployment-scale training, hardware layout becomes another bottleneck.

The article points to two parallelism knobs that matter in Megatron-style stacks:

Expert Parallelism (EP) distributes different experts across different GPUs.
Expert Tensor Parallelism (ETP) shards the weights of one expert across multiple GPUs.

These knobs solve different problems.

Increase EP When You Have Too Many Experts

If the number of experts is large, the main issue is distribution. You need to place more experts across more devices so the system can host the full expert set without overloading each worker.

That is an EP problem.

Increase ETP When One Expert Is Too Large

If a single expert's hidden size is already too large for one device, distributing experts alone does not help. The issue is not the count of experts. The issue is the size of one expert's weights and activations.

That is an ETP problem.

A simple practical heuristic is:

if one expert does not fit comfortably, raise ETP;
if the model has too many experts overall, raise EP.

This aligns with the broader systems trend in Separated Architectures for LLM RL Post-Training, where algorithm design and cluster topology have to be planned together rather than independently.

6. A Practical MoE Post-Training Checklist

If you are building or debugging an MoE post-training stack, the source article compresses into a straightforward operating checklist:

Measure both task quality and expert-utilization quality. Do not tune only one.
Compare aux_loss settings experimentally instead of inheriting them blindly from another model family.
Consider loss-free balancing if auxiliary gradients are clearly damaging convergence.
In RL, treat routing mismatch as a first-class failure mode rather than generic instability.
Use R2 for lighter off-policy settings and R3 when replay stability needs to dominate.
Size EP and ETP from actual hardware bottlenecks, not from architecture diagrams alone.

The article also points to MOE-Patch, a non-invasive monitoring tool for routing distributions, expert load, and token-drop behavior. That kind of instrumentation is useful because MoE failures are often hidden behind decent-looking aggregate loss curves.

Conclusion

MoE post-training is hard for a structural reason: routing is part of the learning problem. That creates three linked engineering challenges.

First, the system must keep experts sufficiently utilized without over-optimizing the balance objective. Second, RL must remain stable even when tiny numerical changes can switch the chosen experts. Third, the hardware layout has to match how experts are actually distributed and sharded.

That is why a good MoE training recipe looks less like a single algorithmic trick and more like a coordinated stack:

balanced routing,
controlled RL replay,
visibility into expert utilization,
hardware-aware expert placement.

For practitioners, that is the real message. MoE post-training is not just dense-model post-training with more parameters. It is a routing-and-systems problem from the first tuning run to the final cluster layout.

References

Noam Shazeer et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Jinhyuk Lee et al. When RLHF Meets MoE: A Case for Routing Replay
NVIDIA. Megatron Bridge Parallelisms Documentation
ModelScope Swift. Megatron-SWIFT Command Line Parameters
direction-yxf. MOE-Patch
Qing Ke Ai. WeChat source post
Chiguodongbutuguodongpi. Original Chinese discussion on Zhihu

Reinforcement Learning Hub

MoE Post-Training Guide: Load Balancing, Routing Replay, and Expert Parallelism

MoE Post-Training: Why Sparse Models Become Harder After Pre-Training

TL;DR

1. Why MoE Post-Training Differs from Dense Models

2. The First Core Problem: Load Balancing Without Damaging the Main Objective

Aux-Loss Balancing

Loss-Free Balancing

3. The Second Core Problem: RL Instability Becomes More Severe in MoE Models

Why MoE Magnifies Numerical Drift

Why MoE Magnifies Policy Lag

4. Routing Replay: The Practical Fix for MoE RL Stability

R2: Vanilla Routing Replay

R3: Rollout Routing Replay

5. The Third Core Problem: Expert Parallelism Is a Hardware Decision, Not Just a Model Decision

Increase EP When You Have Too Many Experts

Increase ETP When One Expert Is Too Large

6. A Practical MoE Post-Training Checklist

Conclusion

References

Further Reading

LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering

Flexible Entropy Control in RLVR: Fixing Policy Entropy Collapse with Dynamic Clipping

Separated Architectures for LLM RL Post-Training

Token Calculator

Reinforcement Learning Hub

Explore More in Reinforcement Learning Hub

Related Articles in Reinforcement Learning Hub

WideSeek-R1: Width Scaling for Broad Information Seeking with Multi-Agent RL

Flexible Entropy Control in RLVR: Fixing Policy Entropy Collapse with Dynamic Clipping

Stable Off-Policy RL with High Data Staleness