LLM Internals Hub

Hub

LLM model knowledge and practice

Explore LLM Internals Hub Hub
Large Language Models

Kimi K2: A Trillion-Parameter Open-Source LLM

Explore Kimi K2, the 1.04T parameter open-source MoE model. Our deep dive covers its MuonClip optimizer, agentic AI training, and benchmark performance.
Ji Zhi Liu
37 min read
#Kimi K2#MoE#LLM architecture#MuonClip optimizer#Agentic Intelligence#open-source LLM#reinforcement learning

Kimi K2 trillion-parameter open-source LLM banner showing model architecture and key features

The release of Kimi K2, a 1.04 trillion-parameter open-source Mixture-of-Experts (MoE) large language model, marks a significant milestone in the AI landscape. While its scale is notable, the true innovation lies in the engineering detailed in its accompanying technical report. The Kimi team has provided the AI community with a comprehensive look at the model's architecture and training methodology, offering a potential blueprint for future large-scale AI development and agentic intelligence.

Kimi K2's development showcases several advances in AI engineering. These include the MuonClip optimizer, which enabled a stable 15.5 trillion token training run without loss spikes; a large-scale synthetic data pipeline for training AI agents; and a sophisticated reinforcement learning framework that combines verifiable rewards with self-critique. This review will analyze the Kimi K2 technical report to unpack the innovations that position it as a noteworthy model in the field.

How powerful is this trillion-parameter open-source model? How might it reshape the AI landscape? And what does the future of the 'Agentic Intelligence' it champions look like? Let's examine the technical details.

Read the full technical report: https://github.com/MoonshotAI/Kimi-K2/blob/main/tech_report.pdf

What is Kimi K2? A New Architecture for Agentic AI

Artificial intelligence is shifting from static imitation learning toward Agentic Intelligence, a key concept for the next generation of AI. Agentic Intelligence is the ability for an AI to autonomously perceive, plan, reason, and interact with complex, dynamic environments. We are at a turning point where AI is evolving from a system that mimics human data into an active learner—an AI agent that can acquire new skills through interaction and, ultimately, may surpass human capabilities on the path to Artificial General Intelligence (AGI).

However, the road to agentic intelligence presents significant hurdles. During pre-training, high-quality data is becoming a scarce resource, making it critical to maximize the learning efficiency of every token. In the post-training phase, teaching the model complex skills like multi-step reasoning, long-term planning, and tool use—abilities rarely found in natural data—is even more challenging.

This is the challenge Kimi K2 was built to solve. It is a Mixture-of-Experts (MoE) large language model with 1.04 trillion total parameters and 32 billion activated parameters per token. Its core design goal is to tackle the central challenges of building agentic AI and redefine what is possible.

Kimi K2's Core Innovations: MuonClip, Data Synthesis, and RL

Kimi K2's breakthroughs span the entire model development lifecycle, from pre-training to post-training:

  • Innovative MuonClip Optimizer: The Kimi team developed a novel optimizer called MuonClip. It fuses the token-efficient Muon algorithm with a stability-enhancing technique called QK-Clip, solving the instability issues that can plague ultra-large-scale training. As a result, Kimi K2 achieved a stable training curve over 15.5 trillion tokens with zero loss spikes—a remarkable feat of engineering.
  • Large-Scale Agent Data Synthesis Pipeline: To teach the model how to use tools, the team built a powerful data synthesis system. This system simulates real-world environments to systematically generate massive, diverse, and high-quality data showing how to use tools to complete tasks. This pipeline is a key component behind Kimi K2's agent skills.
  • Universal Reinforcement Learning Framework: Kimi K2's post-training uses a joint reinforcement learning (RL) framework. It learns from tasks with definitive right-or-wrong answers (like code compilation and math) while also using a self-critique mechanism to improve on open-ended, subjective tasks (like creative writing). This allows for a holistic alignment and refinement of the model's capabilities.

Kimi K2 Performance: Benchmarks vs. GPT-4o & Claude 3.5

While qualitative descriptions are useful, benchmarks provide a clear performance narrative. A single chart from the technical report illustrates Kimi K2's lead across numerous evaluations.

Kimi K2 benchmark performance comparison showing superior results vs GPT-4o and Claude 3.5 Sonnet across coding, agent, and reasoning tasks

Across key benchmarks for agent, coding, and reasoning capabilities, Kimi K2 not only surpasses existing open-source models but, in many zero-shot evaluations, its performance approaches or exceeds top-tier closed-source models like Claude 3.5 Sonnet.

Specifically, Kimi K2 has set new records for open-source models in multiple domains:

  • Agents & Tool Use: On Tau2-Bench and ACEBench, which test complex, multi-turn tool interactions, it scored 66.1 and 76.5, respectively, outperforming all competitors.
  • Software Engineering & Code: On SWE-Bench Verified, a challenging coding benchmark, Kimi K2 achieved 65.8. On the multilingual version, SWE-Bench Multilingual, it reached 47.3, closing the gap with the top-performing closed-source model, Claude 4 Opus.
  • Math & Reasoning: In challenging reasoning tasks like AIME 2024 (49.5) and GPQA-Diamond (75.1), Kimi K2 demonstrated top-tier capabilities.
  • User Preference: On the LMSYS Arena, a blind chatbot benchmark driven by global user votes, Kimi K2 ranked first among all open-source models and fifth overall as of July 17, 2024, with over 3,000 user votes.

To accelerate progress in agentic AI, the Kimi team has open-sourced the full weights for both the Kimi K2 base model and the instruction-tuned model, a significant contribution to AI developers and researchers worldwide.

Kimi K2 Pre-training: Achieving Trillion-Parameter Stability

A large model's pre-training is its foundation. The story of Kimi K2's pre-training is one of innovation, tackling two of the biggest hurdles in large-scale training: data efficiency and training stability.

MuonClip Optimizer: The Key to Stable LLM Training

As model and data sizes increase, maximizing the value of every token has become paramount. The Kimi team opted for the Muon optimizer, known for its superior token efficiency compared to the traditional AdamW optimizer.

However, at Kimi K2's scale, Muon became prone to causing attention logits to explode, leading to loss spikes or training collapse. To address this, the Kimi team developed the QK-Clip technique.

What is QK-Clip? QK-Clip functions as an intelligent regulator. During each training step, it monitors the dot product values (logits) between the Query and Key vectors in the attention mechanism. If a value exceeds a predefined safety threshold (e.g., 100), QK-Clip scales down the weights (Wq, Wk) of the specific attention head causing the issue, pulling the logits back into a safe range.

Its design is notable for its:

  • Precision: It intervenes only with the few problematic attention heads, avoiding a broad approach that could disrupt learning.
  • Dynamic Activation: It is most active in the early stages of training when logits are unstable and automatically becomes dormant as training stabilizes.

The Kimi team combined the Muon optimizer with QK-Clip and other techniques to create the new and robust MuonClip optimizer.

MuonClip optimizer effectiveness graph comparing attention logits with and without QK-Clip showing prevention of training collapse

Ultimately, with MuonClip, Kimi K2's entire pre-training process was exceptionally stable. The raw, unsmoothed training loss curve below shows no loss spikes—a rarity in trillion-parameter model training.

Kimi K2 training loss curve showing zero loss spikes over 15.5 trillion tokens demonstrating exceptional stability

Data Rephrasing: A Strategy for High-Quality Data Augmentation

High-quality human data is a finite resource. Simply repeating data (multi-epoch training) can lead to overfitting. The Kimi K2 team employed a more sophisticated approach: data rephrasing.

The core idea is to use a powerful teacher model to rewrite high-quality text from different perspectives while preserving its core meaning. This generates multiple new data points that are semantically consistent but stylistically diverse, reinforcing knowledge without encouraging rote memorization.

  • Knowledge Data Rephrasing: For knowledge-dense documents like Wikipedia articles, the team designed a chunked autoregressive rephrasing pipeline. It slices a long text, rephrases each chunk sequentially while maintaining context, and then reassembles them into a new article.

Autoregressive chunked data rephrasing pipeline diagram showing how Kimi K2 generates diverse training data while preserving semantic coherence

Experiments showed that training once on 10 different rephrased versions of the data produced better results than training 10 times on the original data.

  • Math Data Rephrasing: To boost mathematical reasoning, the team rewrote high-quality math documents into a "study note" style and translated valuable math materials from other languages, significantly enriching the diversity of the math training data.

Thanks to this strategy, Kimi K2's 15.5 trillion token pre-training dataset—spanning web text, code, math, and knowledge—ensured that every token was used effectively.

Kimi K2 Architecture: A Sparse Mixture-of-Experts (MoE) Model

Kimi K2's scale is 1.04 trillion total parameters, yet it only activates 32.6 billion parameters during inference. This efficiency is achieved through its Mixture-of-Experts (MoE) architecture.

Through extensive experiments, the Kimi team identified a Sparsity Scaling Law: while keeping the number of activated parameters (and thus computational cost) constant, model performance improves as the total number of experts increases (i.e., as the model becomes sparser).

Sparsity Scaling Law graph showing improved performance as total MoE experts increase from 64 to 384 while keeping active experts at 8

Guided by this finding, Kimi K2 was designed with an ultra-high sparsity of 384 experts, activating 8 of them during each forward pass. This strikes a balance between performance and engineering cost.

Table 2: Kimi K2 vs. DeepSeek-V2 Architecture Comparison

DeepSeek-V2KimiK2
#Layers6161=
TotalParameters671B1.04T↑54%
ActivatedParameters37B32.6B↓13%
Experts (total)256384↑ 50%
ExpertsActiveper Token88=
SharedExperts11=
AttentionHeads12864↓50%
Number of Dense Layers31↓ 67%
Expert GroupingYesNo

Kimi K2 significantly increases total parameters and the number of experts while optimizing inference overhead by reducing attention heads and other components.

Furthermore, to boost inference efficiency in long-context scenarios, Kimi K2 reduced the number of attention heads from DeepSeek-V2's 128 to 64. Experiments confirmed this change had a negligible impact on performance but significantly cut the computational cost of long-context inference—a critical optimization for agentic applications.

Training Infrastructure for a Trillion-Parameter Model

Running on a cluster of NVIDIA H800 GPUs, the Kimi team built a highly flexible and efficient training system.

  • Flexible Parallelism Strategy: By combining Pipeline Parallelism (PP), Expert Parallelism (EP), and ZeRO Data Parallelism (DP), Kimi K2 can be trained on any number of nodes that is a multiple of 32, improving R&D agility.
  • Extreme Memory Optimization: To fit the massive model into limited GPU memory, the team used a suite of advanced techniques, including selective recomputation, FP8 activation storage, and CPU memory offloading.

Kimi K2 training system architecture showing overlapped computation, communication, and CPU offloading for optimal GPU utilization

This holistic innovation across the model, algorithms, data, and systems forged Kimi K2's powerful and stable base model.

Post-Training Kimi K2 for Advanced Agentic Intelligence

A powerful base model has raw potential. The post-training process refines that potential into real-world problem-solving proficiency. For Kimi K2, this process was focused on building a world-class AI agent.

Large-Scale Data Synthesis for AI Agent Training

A core capability of modern LLM agents is using unfamiliar tools to interact with the world. Generating massive amounts of high-quality training data for this is a challenge, as real-world experimentation is expensive and risky.

The Kimi team's solution was to build a simulated world that can generate high-quality training data at scale.

Kimi K2 tool-use data synthesis pipeline showing generation of agentic training data from 20,000+ real and synthetic tools

This data synthesis pipeline has several key features:

  • Massive and Diverse Tool Library: The team collected over 3,000 real tools (MCPs) from GitHub and used "domain evolution" techniques to synthesize over 20,000 virtual tools across fields like finance, software, and robotics.

    t-SNE visualization of real MCP tools showing diverse tool clustering across multiple domains t-SNE visualization of synthetic tools demonstrating complementary coverage with real tools The t-SNE dimensionality reduction plots show that real and synthetic tools cover complementary areas, together forming a comprehensive and diverse tool space, ensuring the model can learn a wide range of tool-use capabilities.

  • Diverse Agents and Tasks: By creating different "personas" (system prompts) and assigning various tool combinations, the team generated thousands of agents with distinct capabilities. They then created tasks of varying complexity for each agent, complete with clear success criteria (Rubrics).

  • High-Fidelity Trajectory Generation: The pipeline includes a user simulator to mimic multi-turn conversations, a tool execution environment to simulate real feedback (including success, failure, and errors), and an LLM referee to evaluate trajectory quality.

  • Hybrid of Simulation and Reality: For high-fidelity tasks like software engineering, the team combined the simulated environment with real execution sandboxes. Code is run in a real development environment, and objective metrics like unit test pass rates provide direct feedback.

This hybrid data pipeline gave Kimi K2 a solid and generalizable foundation in tool use during the Supervised Fine-Tuning (SFT) phase.

Joint Reinforcement Learning (RL) with Self-Critique

If SFT is learning from a teacher, then reinforcement learning (RL) is practicing independently. Kimi K2's RL framework combines two types of tasks for unified training.

A. Verifiable Reward Tasks (RLVR)

For tasks with clear right or wrong answers, Kimi K2 trains in a "Verifiable Rewards Gym." This virtual environment is equipped with specialized modules:

  • Math, STEM, and Logic Problems: A massive collection of math competition problems and logic puzzles. The system can automatically verify the model's answer and provide a reward.
  • Complex Instruction Following: Tasks with complex constraints (e.g., "Write a four-line poem about the moon that includes the word 'frost' but not 'sadness'"). The system automatically checks if all constraints are met.
  • Code & Software Engineering: In a sandbox containing real GitHub issues, the model's generated code is run against unit tests. The pass rate serves as a direct reward signal.
  • Safety: An automated red-teaming pipeline continuously generates "jailbreak" prompts to challenge the model. If the model maintains its safety protocols, it receives a reward.

B. Self-Critique Reward Tasks

For subjective tasks without a standard answer, such as creative writing, Kimi K2 uses a self-critique mechanism for reinforcement learning.

This process is called the Self-Critique Rubric Reward mechanism.

  1. K2 Actor Generates Answers: For an open-ended prompt, the K2 model (the "actor") generates several different responses.
  2. K2 Critic Scores the Answers: The K2 model (now in the "critic" role) uses a complex set of internal scoring criteria (a Rubric) to compare the responses and select the best one. This rubric includes general principles like clarity and relevance, as well as specific rules designed by human experts.
  3. Feedback Loop: The "better answer" selected by the critic is used as a positive signal to optimize the actor model.

The critic's ability is not static; it transfers the objective judgment skills learned from verifiable reward tasks to the evaluation of subjective tasks, making its evaluations increasingly reliable over time.

By combining these two task types and adding RL techniques like budget control (to prevent verbosity) and PTX loss (to avoid forgetting SFT data), Kimi K2 dramatically improved its problem-solving abilities in complex domains while maintaining its general capabilities.

Kimi K2 Benchmark Analysis: An In-Depth Performance Review

In the technical report, Kimi K2 was evaluated against today's strongest open-source and closed-source models. All evaluations were conducted using zero-shot prompting to test the models' raw, out-of-the-box intelligence.

Kimi-K2-Instruct vs. Closed-Source Models

This is the user-facing version of the model, so its performance is what matters most to end-users.

Table 3: Performance Comparison of Kimi-K2-Instruct with Top-Tier Models (Bold indicates the best overall, _underlined_ indicates the best among open-source models)

BenchmarkKimi-K2- InstructDeepSeek- V2-0324Qwen2.5- 235B-Claude 3.5 SonnetClaude 3 OpusGPT-4oGemini 1.5 Flash
Coding Tasks
LiveCodeBench v6 (Pass@ 1)53.746.937.048.547.444.744.7
OJBench (Pass @1)27.124.011.315.319.619.519.5
MultiPL-E (Pass@ 1)85.783.178.288.689.686.785.6
SWE-bench Verified (Agentic-Single-Attempt)65.838.834.472.7*72.5*54.6
SWE-bench Multilingual (Pass@1)47.325.820.951.031.514.0
Tool Use Tasks
ToolBench v2 (Overall Avg@4)66.148.837.361.266.356.041.2
AceBench (Acc.)76.572.770.576.275.680.174.5
Math & STEM Tasks
AIME 2024 (Avg@64)69.659.4*40.1*43.448.246.561.3
AIME 2025 (Avg@64)49.546.724.7*33.1*33.9*37.046.6
GPQA-Diamond (Avg@8)75.168.4*62.9*70.0*74.9*66.368.2
General Tasks
MMLU (EM)89.589.487.091.592.990.490.1
MMLU-Redux (EM)92.790.589.2*93.694.292.490.6
IFEval (Prompt Strict)89.881.183.2*87.687.488.084.3

The key takeaways from this table are clear:

  • Unrivaled Code and Agent Capabilities: In tests for coding and tool use, Kimi K2 decisively outperforms all other open-source models. On real-world software engineering tasks like SWE-bench, it is competitive with the strongest closed-source models, showcasing its potential as an AI agent.
  • Leads the Open-Source Pack in Math and Reasoning: On highly challenging math and science benchmarks like AIME and GPQA-Diamond, Kimi K2 also achieves the best scores among open-source models, even surpassing some closed-source competitors.
  • Comprehensive Lead in General Capabilities: In general knowledge tests like MMLU, Kimi K2 is firmly in the top tier of open-source models. On IFEval, which measures instruction-following ability, it achieved the highest score of any model tested.

Kimi-K2-Base: SOTA Performance for Open-Source LLMs

The performance of the base model is a direct reflection of its pre-training quality. Here too, Kimi K2's base model demonstrates impressive raw capabilities.

Table 4: Performance Comparison of Kimi-K2-Base with Mainstream Open-Source Base Models

Benchmark(Metric)Kimi-K2-Base (32B/1043B)DeepSeek-V2-Base (37B/671B)Llama3.1-405B-Base (17B/400B)Qwen2-72B-Base (Dense 72B)
English General
MMLU87.7987.1084.8786.08
MMLU-pro69.1760.5963.4762.80
GPQA-Diamond50.5149.4348.1140.78
SimpleQA35.2526.4923.7410.31
Code
CRUXEval-I-cot74.0062.7567.1361.12
LiveCodeBench(v6)26.2924.5725.1422.29
EvalPlus80.3365.6165.4866.04
Math
MATH70.2261.7063.0262.68
GSM8k92.1291.6686.3590.37
Chinese
C-Eval92.5090.0480.9190.86
CMMLU90.9088.8481.2490.55

The takeaway is clear: Kimi-K2-Base achieves State-of-the-Art (SOTA) performance on the vast majority of benchmarks across four major domains—English general knowledge, code, math, and Chinese. This is a powerful validation of its pre-training methodology and provides an unparalleled foundation for downstream fine-tuning and agent development.

Safety and Robustness Evaluation

The Kimi team also conducted rigorous red-teaming to evaluate the model's robustness against harmful content, privacy violations, and security threats. The results show that Kimi K2 exhibits good safety in most scenarios, with a particularly high pass rate against basic attacks. Like all large models, it still has room for improvement against complex, iterative "jailbreak" attacks.

Current Limitations and Future Outlook for Kimi K2

The technical report is also transparent about Kimi K2's current limitations:

  • The model can be overly verbose when tackling very difficult reasoning tasks or when tool definitions are unclear.
  • It sometimes attempts to use tools when it is not necessary, which can degrade performance.
  • When building complete software projects from scratch, its one-shot success rate is lower than when operating within an agentic framework.

The Kimi team stated they are actively working to address these issues and welcome community feedback to help the model continue to improve.

Conclusion: Kimi K2's Blueprint for Open-Source Agentic AI

The release of Kimi K2 represents a significant contribution to the open-source AI community. Its novelty lies not just in its scale but in its meticulously documented engineering, which provides a transparent blueprint for developing 'Agentic Intelligence.' From the MuonClip optimizer that ensures training stability to a sophisticated data synthesis and reinforcement learning framework, Kimi K2's architecture addresses key challenges in creating advanced AI agents.

Its benchmark performance, particularly in software engineering and tool use, validates this technical approach and demonstrates that open-source models can achieve parity with leading closed-source systems on the agentic frontier. By open-sourcing the model and its methodology, the Kimi team has provided a powerful foundation for researchers and developers, accelerating the exploration and deployment of agentic AI applications.

Kimi K2 is not an endpoint but a new baseline. It marks a step forward for open-source development, accelerating the collective journey toward more capable and general artificial intelligence.

References and Links

For your convenience, here are the expanded links to some of the important literature cited in the report:

Key Takeaways

• Kimi K2 features 1.04 trillion parameters, enhancing large language model capabilities.
• The model utilizes the MuonClip optimizer for improved training efficiency and performance.
• Kimi K2's open-source nature promotes collaboration and innovation within the AI community.

Further Reading

Explore More in LLM Internals Hub

This article is part of our LLM Internals series. Discover more insights and practical guides.

Visit LLM Internals Hub

About This Article

Topic: Large Language Models
Difficulty: Intermediate
Reading Time: 37 minutes
Last Updated: November 15, 2025

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge