Large-model architecture design is ultimately an efficiency problem: use fewer theoretical FLOPs and fewer GPU hours to unlock better scaling behavior.
For Ling 2.5, the team focused on a practical question:
How do you keep quality while removing the long-context inference bottleneck created by Full Attention?
At normal context sizes (4K or 8K), compute is often dominated by MoE blocks.
But once context grows to 32K, 128K, or 256K+, attention quickly becomes the limiting factor.
In that regime, improving sparse MoE alone is not enough; attention itself must change.
Why Ling 2.5 Moves to Hybrid Linear Attention
There are two common routes to improve attention efficiency:
- Sparsification: keeps Full Attention semantics, safer but with lower theoretical ceiling.
- Linearization: changes compute characteristics completely, with much larger upside but harder engineering.
The Ling team explored both and invested more in linearization after prior work (Ling 2.0 + Ling Linear V2.0) showed four key findings:
- Pure linear attention underperforms Full Attention on scaling trend.
- A hybrid of linear and Full Attention can match or exceed Full Attention in normal windows.
- Hybrid linear attention does not inherently break long-context handling or multi-step reasoning.
- With careful implementation and post-training, hybrid attention can retain or improve quality.
Core Architecture Changes in Ling 2.5
1) Introduce Lightning Attention for Long-Sequence Throughput
Part of the original GQA layers is replaced with Lightning Attention.
To support the switch, GQA dimensions are expanded toward MHA-style parameterization, new parameters are initialized, and warmup is used to stabilize transition.
Result: significantly better throughput in very long-context prefill/decode workloads.
2) Integrate MLA for Aggressive KV Cache Compression
Inference performance is not only math throughput; memory pressure from KV cache is equally important.
Compared with GQA, MLA provides much stronger KV compression.
Ablations on Ling 2.0 mini/flash scales showed that after conversion and continued training, performance recovers quickly and can surpass GQA baselines, so Ling 2.5 adopts MLA in the hybrid design.
3) Resolve Compatibility: QK Norm + Partial RoPE
Converting GQA/MHA to MLA in this codebase faced two concrete incompatibilities:
- QK Norm nonlinearity blocks efficient KV absorption during inference.
- Partial RoPE in Ling 2.0 differs from Full RoPE assumptions in prior conversion methods.
The team handled this with:
- Parameter fusion of QK Norm into projection weights through calibration.
- A partial-RoPE-aware decomposition pipeline: operate only on RoPE-related dimensions, then recombine.
Smooth Migration Training Strategy
To minimize quality loss during structural conversion, Ling 2.5 follows a staged migration:
Stage A: GQA -> Lightning Attention + GQA Hybrid
- Expand
linear_qkvby head dimension. - Initialize newly introduced gating parameters.
- Keep QK Norm and Partial RoPE during the early transition for stability.
Stage B: Linear Warmup
- Freeze most parameters except converted attention-critical parts.
- Use LR warmup + limited continued training to quickly restore pre-conversion loss levels.
Stage C: GQA -> MLA Conversion
- Remove QK Norm by absorbing it into
q_proj/k_projthrough sampled calibration. - Apply Partial-RoPE-compatible conversion.
- Continue short warmup to recover the small temporary PPL increase.
The two key conversion equations from the original practice notes are shown below:
Stage D: Full-Parameter Training
After stability is confirmed, unfreeze all parameters and continue full training at target scale.
Scaling Law Result: Why 1:7 Wins
With equal FLOPs constraints, the team compared multiple hybrid ratios of Linear Attention to Full Attention.
The observed trade-off:
- 1:7 (group size M=8) gave the best quality/efficiency balance.
- Smaller M (e.g., 2 or 4) had similar quality but much higher inference cost.
- Larger M (e.g., 16) reduced inference cost further but degraded loss too much.
So Ling 2.5 settles on 1:7 as the operating point.
Data Switch Strategy During Continued Pretraining
Architecture migration and data refresh happened together, so timing of new-data switch mattered.
Two plans were tested:
- Conservative: recover on old data first, then switch.
- Aggressive: switch to higher-quality new data earlier in full-parameter training.
The aggressive plan achieved better late-stage ceiling and faster recovery with fewer tokens, so it was selected.
Practical Takeaway
Ling 2.5 is an engineering-first architecture upgrade:
- higher long-context throughput,
- lower KV cache overhead,
- preserved model quality after conversion.
It is especially relevant to agent-style workloads where deep reasoning, tool calls, and long execution chains inflate context length rapidly.