What is TensorRT-LLM?

TensorRT-LLM is NVIDIA's high-performance inference optimization framework for large language models. It combines TensorRT's optimization capabilities with custom plugins for attention mechanisms and multi-node communication to deliver best-in-class LLM inference performance on NVIDIA GPUs.

Why did TensorRT-LLM switch to a PyTorch-first backend?

The switch to PyTorch-first architecture was driven by three key factors: improved usability for developers familiar with PyTorch, significantly faster development velocity (3x faster prototyping), and reduced ecosystem friction. The original C++-centric design created high barriers to entry and slow iteration cycles for new features.

What are the performance benefits of TensorRT-LLM's PyTorch backend?

The PyTorch backend maintains TensorRT-LLM's industry-leading inference performance on NVIDIA GPUs while dramatically improving development speed. It powered the MLPerf 5.1 inference submission and GB200 performance showcases, proving it can handle production workloads at scale.

How does TensorRT-LLM handle KV Cache management?

TensorRT-LLM uses a separate runtime layer for KV Cache management and batch scheduling, rather than relying solely on graph compilation. This approach sidesteps the challenges of stateful KV Cache with Static Single Assignment (SSA) semantics used by classical compilers, enabling more flexible dynamic batching.

What advanced LLM optimization techniques does TensorRT-LLM support?

TensorRT-LLM supports expert parallelism for mixture-of-experts models, disaggregated serving for compute/memory separation, speculative decoding methods like N-Gram and guided decoding, and advanced features like ADP balance strategy and inference-time compute frameworks.

TensorRT-LLM Tutorial: Deploy LLMs 3x Faster (2025 Setup Guide)

The Engine Swap: Inside TensorRT-LLM’s Pivot to a PyTorch-First Architecture

In late September 2024, we marked a significant milestone: the official 1.0 release of TensorRT-LLM. This version formalized our PyTorch-based backend as the default, production-ready workflow and established clear backward-compatibility guarantees for its LLM API. The journey to this point, however, was a multi-year effort defined by a fundamental re-architecture—a pivot from a custom C++-centric design to a more flexible, PyTorch-native approach. This is the inside story of that transformation, the lessons we learned, and our vision for the future of high-performance LLM inference.

The project’s first public release, version 0.5.0, arrived in October 2023. At the time, we were operating in a "GitHub-second" model, with development occurring primarily in a private NVIDIA repository. Then, on March 18, 2024, we switched to a "GitHub-first" model. This shift has dramatically improved the speed and quality of our collaboration with the open-source community.

With version 1.0 now released, it feels like the perfect time to reflect on the project's journey and share our thoughts on what lies ahead for LLM inference optimization.

The Origins of TensorRT-LLM: Merging FasterTransformer and TensorRT

TensorRT-LLM began in September 2021 with an ambitious goal: merge two powerful inference optimization efforts, FasterTransformer and TensorRT, to tackle the unique challenges of LLM inference. As a new engineer at NVIDIA, this monumental task fell to me.

As I began sketching out the technical approach, we explored several paths for this new LLM inference solution. The most obvious was to modify the TensorRT core to natively support multi-node, multi-GPU configurations. We also considered using graph compilation to handle critical components like the attention kernel, operator fusion, and the complex logic of the KV Cache.

Early Challenges with ONNX and C++ Runtimes

Traditionally, TensorRT’s strength lies in ingesting an ONNX model to unlock its suite of optimizations. However, the flexibility of modern LLM architectures and the dynamic nature of PyTorch made a stable ONNX export path a moving target. This made the dominant ONNX-first workflow a significant hurdle for LLM inference optimization.

On the backend, core LLM concepts like the stateful KV Cache clash with the Static Single Assignment (SSA) semantics that classical compilers rely on. This, combined with dynamic batching and scheduling logic, made a single, unified graph compiler exponentially more difficult.

This led us to a different approach: use TensorRT as a powerful engine for the model's forward pass, extended with custom plugins for attention and multi-node communication. For everything else—like KV Cache management and batch scheduling—we would build a separate C++ runtime. This design sidestepped the ONNX requirement by introducing a new frontend modeling API that mirrored PyTorch's operator semantics. The trade-off was that users had to redefine their models using this new API, but it gave them precise control over the generated TensorRT graph.

After successfully validating the approach on GPT-2 and GPT-3, the core design received the green light in July 2022.

Rapid Growth and the Limits of a C++ Architecture

I call this period the "TensorRT-based software architecture" phase. After we validated the core design, the launch of ChatGPT in late 2022 made LLM inference acceleration more critical than ever.

In 2023, the TensorRT-LLM project kicked into high gear, with engineers from teams across the globe joining the effort in a "swarm" model. The upside was rapid scaling, but it was a double-edged sword. Without meticulous coordination, this approach led to software inconsistencies, creating a "stitched-together" impression.

At the time, my hands-off management style encouraged rapid iteration but didn't enforce strict software principles, for which I take responsibility. The lessons learned were invaluable. Despite the growing pains, the project's velocity skyrocketed. We shipped several releases to Early Access customers, powered our first MLPerf submission, and launched our first public version, 0.5.0, in October 2023.

The Pivot to a PyTorch Backend for LLM Inference

After the 0.5.0 release, user feedback was clear: TensorRT-LLM's performance on NVIDIA GPUs was unmatched, but its usability was a major pain point. The barrier to entry was simply too high.

From late 2023 to mid-2024, we worked to address this by adding new features and revamping the user workflow. A key moment came during our first implementation of the Medusa algorithm. The initial version, built with PyTorch operators, was remarkably fast to develop compared to our C++ runtime implementations. This was a lightbulb moment, hinting that the high development cost of our C++-centric design was unsustainable.

Addressing Usability and Development Velocity

Despite improvements like the trtllm-build command and a new LLM API, complaints about usability grew. The root causes of these challenges were:

Decentralized Development: Early design inconsistencies became problematic as the project's complexity grew.
Ecosystem Friction: The separation between the TensorRT and PyTorch ecosystems created a "conversion tax" that frustrated users and slowed our own development.
Slow Iteration: Prototyping a new algorithm like speculative decoding in PyTorch and then porting it to our TensorRT-based flow was a massive effort that killed iteration speed.

These issues were my responsibility, and they became the primary catalyst for our architectural pivot.

Re-architecting TensorRT-LLM: A PyTorch-First Approach

In late 2023, we initiated a major re-architecture effort centered on PyTorch. The goal was to resolve usability and development velocity issues while improving performance. This was the classic challenge of "changing the engine while the plane is in flight."

I focused heavily on establishing clear design principles. The user-facing entry point was unified through the LLM API to abstract away the backend implementation. For the backend, the runtime would be based on our modularized C++ implementation exposed via Python bindings. Our high-performance custom kernels would be wrapped to serve as both TensorRT plugins and PyTorch custom ops.

By late January 2024, we quietly released the PyTorch backend as an experimental feature, already including support for the upcoming Blackwell hardware architecture to streamline future migrations.

Validating the New Architecture with DeepSeek R1

Just before the 2024 Lunar New Year, the DeepSeek team released their R1 model. We faced a choice: support it on our mature TensorRT architecture or bet on the new PyTorch backend. I decided we would go all-in on the PyTorch architecture. My reasoning was that the conversion, debugging, and extension complexities of the old flow would be magnified on a complex model like DeepSeek R1.

This bet paid off. The team delivered results faster, and the work was featured at GTC. From that point on, nearly all new model support and optimization work for high-performance LLM inference was done on the PyTorch architecture.

TensorRT-LLM 1.0: The PyTorch Backend Goes Production

After GTC 2024, we continued to refine the architecture, addressing tech debt accrued during our rapid development. For foundational software like TensorRT-LLM, continuous refactoring is essential.

By September 2024, the PyTorch-based architecture had been battle-tested, powering key model optimizations, the MLPerf 5.1 inference submission, and performance showcases on GB200. With its success proven, we officially launched the 1.0 release on September 24th—a historic milestone for the project.

Advanced LLM Optimization Techniques

Much of our recent work on advanced features is detailed in our technical blogs, covering topics from expert parallelism to new speculative decoding methods:

Expert Parallelism

Disaggregated Serving

Disaggregated Serving in TensorRT-LLM

Speculative Decoding

Advanced Features

Future of High-Performance LLM Inference with TensorRT-LLM

The fields of AI systems and inference technology are still brimming with exciting challenges. Making AI a true General-Purpose Technology requires a relentless drive to lower costs and improve efficiency. It is a privilege to contribute to that progress.

The journey of TensorRT-LLM underscores a critical principle: the most performant architecture is useless if it impedes innovation. By embracing the PyTorch ecosystem our users live in, we not only solved our own development bottlenecks but also built a stronger, more sustainable foundation for the future of AI inference.

References

Key Takeaways

• TensorRT-LLM officially transitioned to a PyTorch-first backend for improved performance.
• The 1.0 release ensures backward compatibility for existing LLM API users.
• This shift enhances flexibility and optimizes LLM inference on NVIDIA GPUs.

Infrastructure Hub

TensorRT-LLM Tutorial: Deploy LLMs 3x Faster (2025 Setup Guide)

The Engine Swap: Inside TensorRT-LLM’s Pivot to a PyTorch-First Architecture

The Origins of TensorRT-LLM: Merging FasterTransformer and TensorRT

Early Challenges with ONNX and C++ Runtimes

Rapid Growth and the Limits of a C++ Architecture

The Pivot to a PyTorch Backend for LLM Inference

Addressing Usability and Development Velocity

Re-architecting TensorRT-LLM: A PyTorch-First Approach

Validating the New Architecture with DeepSeek R1

TensorRT-LLM 1.0: The PyTorch Backend Goes Production

Advanced LLM Optimization Techniques

Future of High-Performance LLM Inference with TensorRT-LLM

References

Key Takeaways

Further Reading

AI Infrastructure: The Real Engine Behind AI Agents

First Principles of GPU Performance

Infrastructure Hub

Explore More in Infrastructure Hub

Related Articles in Infrastructure Hub

30x Faster LLM RL Training: The Checkpoint-Engine Story

Boost LLM Goodput: Prefill-Decode Separation

Run Llama 3 Locally: 5-10x Faster with Ollama (8GB RAM Guide - 2025)