Inside Ant Ling 2.5: Rebuilding Attention With MLA + Lightning Attention

Large-model architecture design is ultimately an efficiency problem: use fewer theoretical FLOPs and fewer GPU hours to unlock better scaling behavior.
For Ling 2.5, the team focused on a practical question:

How do you keep quality while removing the long-context inference bottleneck created by Full Attention?

At normal context sizes (4K or 8K), compute is often dominated by MoE blocks.
But once context grows to 32K, 128K, or 256K+, attention quickly becomes the limiting factor.
In that regime, improving sparse MoE alone is not enough; attention itself must change.

Why Ling 2.5 Moves to Hybrid Linear Attention

There are two common routes to improve attention efficiency:

Sparsification: keeps Full Attention semantics, safer but with lower theoretical ceiling.
Linearization: changes compute characteristics completely, with much larger upside but harder engineering.

The Ling team explored both and invested more in linearization after prior work (Ling 2.0 + Ling Linear V2.0) showed four key findings:

Pure linear attention underperforms Full Attention on scaling trend.
A hybrid of linear and Full Attention can match or exceed Full Attention in normal windows.
Hybrid linear attention does not inherently break long-context handling or multi-step reasoning.
With careful implementation and post-training, hybrid attention can retain or improve quality.

Core Architecture Changes in Ling 2.5

1) Introduce Lightning Attention for Long-Sequence Throughput

Part of the original GQA layers is replaced with Lightning Attention.
To support the switch, GQA dimensions are expanded toward MHA-style parameterization, new parameters are initialized, and warmup is used to stabilize transition.

Result: significantly better throughput in very long-context prefill/decode workloads.

2) Integrate MLA for Aggressive KV Cache Compression

Inference performance is not only math throughput; memory pressure from KV cache is equally important.
Compared with GQA, MLA provides much stronger KV compression.

Ablations on Ling 2.0 mini/flash scales showed that after conversion and continued training, performance recovers quickly and can surpass GQA baselines, so Ling 2.5 adopts MLA in the hybrid design.

3) Resolve Compatibility: QK Norm + Partial RoPE

Converting GQA/MHA to MLA in this codebase faced two concrete incompatibilities:

QK Norm nonlinearity blocks efficient KV absorption during inference.
Partial RoPE in Ling 2.0 differs from Full RoPE assumptions in prior conversion methods.

The team handled this with:

Parameter fusion of QK Norm into projection weights through calibration.
A partial-RoPE-aware decomposition pipeline: operate only on RoPE-related dimensions, then recombine.

Smooth Migration Training Strategy

To minimize quality loss during structural conversion, Ling 2.5 follows a staged migration:

Stage A: GQA -> Lightning Attention + GQA Hybrid

Expand linear_qkv by head dimension.
Initialize newly introduced gating parameters.
Keep QK Norm and Partial RoPE during the early transition for stability.

Stage B: Linear Warmup

Freeze most parameters except converted attention-critical parts.
Use LR warmup + limited continued training to quickly restore pre-conversion loss levels.

Stage C: GQA -> MLA Conversion

Remove QK Norm by absorbing it into q_proj / k_proj through sampled calibration.
Apply Partial-RoPE-compatible conversion.
Continue short warmup to recover the small temporary PPL increase.

The two key conversion equations from the original practice notes are shown below:

QK norm absorption equation 1

QK norm absorption equation 2

Stage D: Full-Parameter Training

After stability is confirmed, unfreeze all parameters and continue full training at target scale.

Scaling Law Result: Why 1:7 Wins

With equal FLOPs constraints, the team compared multiple hybrid ratios of Linear Attention to Full Attention.
The observed trade-off:

1:7 (group size M=8) gave the best quality/efficiency balance.
Smaller M (e.g., 2 or 4) had similar quality but much higher inference cost.
Larger M (e.g., 16) reduced inference cost further but degraded loss too much.

So Ling 2.5 settles on 1:7 as the operating point.

Data Switch Strategy During Continued Pretraining

Architecture migration and data refresh happened together, so timing of new-data switch mattered.

Two plans were tested:

Conservative: recover on old data first, then switch.
Aggressive: switch to higher-quality new data earlier in full-parameter training.

The aggressive plan achieved better late-stage ceiling and faster recovery with fewer tokens, so it was selected.

Practical Takeaway

Ling 2.5 is an engineering-first architecture upgrade:

higher long-context throughput,
lower KV cache overhead,
preserved model quality after conversion.

It is especially relevant to agent-style workloads where deep reasoning, tool calls, and long execution chains inflate context length rapidly.

References

Next Step: Practical Tools

Token Calculator - Estimate model token budgets and API costs before you ship.
All Calculators - Browse every calculator and pick the right one for your use case.
Infrastructure Hub - Dive into inference, GPU systems, and serving architecture.

This block is automatically maintained by the revenue pipeline.