AI Architecture

Inside Ant Ling 2.5: Rebuilding Attention With MLA + Lightning Attention

How Ling 2.5 replaces part of GQA with a 1:7 MLA + Lightning Attention design to improve long-context throughput, reduce KV cache cost, and keep training quality stable.
Qingke AI
4 min read
#Ling 2.5#Linear Attention#MLA#Long Context#LLM Training

Large-model architecture design is ultimately an efficiency problem: use fewer theoretical FLOPs and fewer GPU hours to unlock better scaling behavior.
For Ling 2.5, the team focused on a practical question:

How do you keep quality while removing the long-context inference bottleneck created by Full Attention?

At normal context sizes (4K or 8K), compute is often dominated by MoE blocks.
But once context grows to 32K, 128K, or 256K+, attention quickly becomes the limiting factor.
In that regime, improving sparse MoE alone is not enough; attention itself must change.

Why Ling 2.5 Moves to Hybrid Linear Attention

There are two common routes to improve attention efficiency:

  • Sparsification: keeps Full Attention semantics, safer but with lower theoretical ceiling.
  • Linearization: changes compute characteristics completely, with much larger upside but harder engineering.

The Ling team explored both and invested more in linearization after prior work (Ling 2.0 + Ling Linear V2.0) showed four key findings:

  1. Pure linear attention underperforms Full Attention on scaling trend.
  2. A hybrid of linear and Full Attention can match or exceed Full Attention in normal windows.
  3. Hybrid linear attention does not inherently break long-context handling or multi-step reasoning.
  4. With careful implementation and post-training, hybrid attention can retain or improve quality.

Core Architecture Changes in Ling 2.5

1) Introduce Lightning Attention for Long-Sequence Throughput

Part of the original GQA layers is replaced with Lightning Attention.
To support the switch, GQA dimensions are expanded toward MHA-style parameterization, new parameters are initialized, and warmup is used to stabilize transition.

Result: significantly better throughput in very long-context prefill/decode workloads.

2) Integrate MLA for Aggressive KV Cache Compression

Inference performance is not only math throughput; memory pressure from KV cache is equally important.
Compared with GQA, MLA provides much stronger KV compression.

Ablations on Ling 2.0 mini/flash scales showed that after conversion and continued training, performance recovers quickly and can surpass GQA baselines, so Ling 2.5 adopts MLA in the hybrid design.

3) Resolve Compatibility: QK Norm + Partial RoPE

Converting GQA/MHA to MLA in this codebase faced two concrete incompatibilities:

  • QK Norm nonlinearity blocks efficient KV absorption during inference.
  • Partial RoPE in Ling 2.0 differs from Full RoPE assumptions in prior conversion methods.

The team handled this with:

  • Parameter fusion of QK Norm into projection weights through calibration.
  • A partial-RoPE-aware decomposition pipeline: operate only on RoPE-related dimensions, then recombine.

Smooth Migration Training Strategy

To minimize quality loss during structural conversion, Ling 2.5 follows a staged migration:

Stage A: GQA -> Lightning Attention + GQA Hybrid

  • Expand linear_qkv by head dimension.
  • Initialize newly introduced gating parameters.
  • Keep QK Norm and Partial RoPE during the early transition for stability.

Stage B: Linear Warmup

  • Freeze most parameters except converted attention-critical parts.
  • Use LR warmup + limited continued training to quickly restore pre-conversion loss levels.

Stage C: GQA -> MLA Conversion

  • Remove QK Norm by absorbing it into q_proj / k_proj through sampled calibration.
  • Apply Partial-RoPE-compatible conversion.
  • Continue short warmup to recover the small temporary PPL increase.

The two key conversion equations from the original practice notes are shown below:

QK norm absorption equation 1

QK norm absorption equation 2

Stage D: Full-Parameter Training

After stability is confirmed, unfreeze all parameters and continue full training at target scale.

Scaling Law Result: Why 1:7 Wins

With equal FLOPs constraints, the team compared multiple hybrid ratios of Linear Attention to Full Attention.
The observed trade-off:

  • 1:7 (group size M=8) gave the best quality/efficiency balance.
  • Smaller M (e.g., 2 or 4) had similar quality but much higher inference cost.
  • Larger M (e.g., 16) reduced inference cost further but degraded loss too much.

So Ling 2.5 settles on 1:7 as the operating point.

Data Switch Strategy During Continued Pretraining

Architecture migration and data refresh happened together, so timing of new-data switch mattered.

Two plans were tested:

  • Conservative: recover on old data first, then switch.
  • Aggressive: switch to higher-quality new data earlier in full-parameter training.

The aggressive plan achieved better late-stage ceiling and faster recovery with fewer tokens, so it was selected.

Practical Takeaway

Ling 2.5 is an engineering-first architecture upgrade:

  • higher long-context throughput,
  • lower KV cache overhead,
  • preserved model quality after conversion.

It is especially relevant to agent-style workloads where deep reasoning, tool calls, and long execution chains inflate context length rapidly.

References

About This Article

Topic: AI Architecture
Difficulty: Intermediate
Reading Time: 4 minutes
Last Updated: March 2, 2026

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge