The Engine Swap: Inside TensorRT-LLM’s Pivot to a PyTorch-First Architecture
In late September 2024, we marked a significant milestone: the official 1.0 release of TensorRT-LLM. This version formalized our PyTorch-based backend as the default, production-ready workflow and established clear backward-compatibility guarantees for its LLM API. The journey to this point, however, was a multi-year effort defined by a fundamental re-architecture—a pivot from a custom C++-centric design to a more flexible, PyTorch-native approach. This is the inside story of that transformation, the lessons we learned, and our vision for the future of high-performance LLM inference.
The project’s first public release, version 0.5.0, arrived in October 2023. At the time, we were operating in a "GitHub-second" model, with development occurring primarily in a private NVIDIA repository. Then, on March 18, 2024, we switched to a "GitHub-first" model. This shift has dramatically improved the speed and quality of our collaboration with the open-source community.
With version 1.0 now released, it feels like the perfect time to reflect on the project's journey and share our thoughts on what lies ahead for LLM inference optimization.
The Origins of TensorRT-LLM: Merging FasterTransformer and TensorRT
TensorRT-LLM began in September 2021 with an ambitious goal: merge two powerful inference optimization efforts, FasterTransformer and TensorRT, to tackle the unique challenges of LLM inference. As a new engineer at NVIDIA, this monumental task fell to me.
As I began sketching out the technical approach, we explored several paths for this new LLM inference solution. The most obvious was to modify the TensorRT core to natively support multi-node, multi-GPU configurations. We also considered using graph compilation to handle critical components like the attention kernel, operator fusion, and the complex logic of the KV Cache.
Early Challenges with ONNX and C++ Runtimes
Traditionally, TensorRT’s strength lies in ingesting an ONNX model to unlock its suite of optimizations. However, the flexibility of modern LLM architectures and the dynamic nature of PyTorch made a stable ONNX export path a moving target. This made the dominant ONNX-first workflow a significant hurdle for LLM inference optimization.
On the backend, core LLM concepts like the stateful KV Cache clash with the Static Single Assignment (SSA) semantics that classical compilers rely on. This, combined with dynamic batching and scheduling logic, made a single, unified graph compiler exponentially more difficult.
This led us to a different approach: use TensorRT as a powerful engine for the model's forward pass, extended with custom plugins for attention and multi-node communication. For everything else—like KV Cache management and batch scheduling—we would build a separate C++ runtime. This design sidestepped the ONNX requirement by introducing a new frontend modeling API that mirrored PyTorch's operator semantics. The trade-off was that users had to redefine their models using this new API, but it gave them precise control over the generated TensorRT graph.
After successfully validating the approach on GPT-2 and GPT-3, the core design received the green light in July 2022.
Rapid Growth and the Limits of a C++ Architecture
I call this period the "TensorRT-based software architecture" phase. After we validated the core design, the launch of ChatGPT in late 2022 made LLM inference acceleration more critical than ever.
In 2023, the TensorRT-LLM project kicked into high gear, with engineers from teams across the globe joining the effort in a "swarm" model. The upside was rapid scaling, but it was a double-edged sword. Without meticulous coordination, this approach led to software inconsistencies, creating a "stitched-together" impression.
At the time, my hands-off management style encouraged rapid iteration but didn't enforce strict software principles, for which I take responsibility. The lessons learned were invaluable. Despite the growing pains, the project's velocity skyrocketed. We shipped several releases to Early Access customers, powered our first MLPerf submission, and launched our first public version, 0.5.0, in October 2023.
The Pivot to a PyTorch Backend for LLM Inference
After the 0.5.0 release, user feedback was clear: TensorRT-LLM's performance on NVIDIA GPUs was unmatched, but its usability was a major pain point. The barrier to entry was simply too high.
From late 2023 to mid-2024, we worked to address this by adding new features and revamping the user workflow. A key moment came during our first implementation of the Medusa algorithm. The initial version, built with PyTorch operators, was remarkably fast to develop compared to our C++ runtime implementations. This was a lightbulb moment, hinting that the high development cost of our C++-centric design was unsustainable.
Addressing Usability and Development Velocity
Despite improvements like the trtllm-build
command and a new LLM API, complaints about usability grew. The root causes of these challenges were:
- Decentralized Development: Early design inconsistencies became problematic as the project's complexity grew.
- Ecosystem Friction: The separation between the TensorRT and PyTorch ecosystems created a "conversion tax" that frustrated users and slowed our own development.
- Slow Iteration: Prototyping a new algorithm like speculative decoding in PyTorch and then porting it to our TensorRT-based flow was a massive effort that killed iteration speed.
These issues were my responsibility, and they became the primary catalyst for our architectural pivot.
Re-architecting TensorRT-LLM: A PyTorch-First Approach
In late 2023, we initiated a major re-architecture effort centered on PyTorch. The goal was to resolve usability and development velocity issues while improving performance. This was the classic challenge of "changing the engine while the plane is in flight."
I focused heavily on establishing clear design principles. The user-facing entry point was unified through the LLM API to abstract away the backend implementation. For the backend, the runtime would be based on our modularized C++ implementation exposed via Python bindings. Our high-performance custom kernels would be wrapped to serve as both TensorRT plugins and PyTorch custom ops.
By late January 2024, we quietly released the PyTorch backend as an experimental feature, already including support for the upcoming Blackwell hardware architecture to streamline future migrations.
Validating the New Architecture with DeepSeek R1
Just before the 2024 Lunar New Year, the DeepSeek team released their R1 model. We faced a choice: support it on our mature TensorRT architecture or bet on the new PyTorch backend. I decided we would go all-in on the PyTorch architecture. My reasoning was that the conversion, debugging, and extension complexities of the old flow would be magnified on a complex model like DeepSeek R1.
This bet paid off. The team delivered results faster, and the work was featured at GTC. From that point on, nearly all new model support and optimization work for high-performance LLM inference was done on the PyTorch architecture.
TensorRT-LLM 1.0: The PyTorch Backend Goes Production
After GTC 2024, we continued to refine the architecture, addressing tech debt accrued during our rapid development. For foundational software like TensorRT-LLM, continuous refactoring is essential.
By September 2024, the PyTorch-based architecture had been battle-tested, powering key model optimizations, the MLPerf 5.1 inference submission, and performance showcases on GB200. With its success proven, we officially launched the 1.0 release on September 24th—a historic milestone for the project.
Advanced LLM Optimization Techniques
Much of our recent work on advanced features is detailed in our technical blogs, covering topics from expert parallelism to new speculative decoding methods:
Expert Parallelism
Disaggregated Serving
Speculative Decoding
Advanced Features
Future of High-Performance LLM Inference with TensorRT-LLM
The fields of AI systems and inference technology are still brimming with exciting challenges. Making AI a true General-Purpose Technology requires a relentless drive to lower costs and improve efficiency. It is a privilege to contribute to that progress.
The journey of TensorRT-LLM underscores a critical principle: the most performant architecture is useless if it impedes innovation. By embracing the PyTorch ecosystem our users live in, we not only solved our own development bottlenecks but also built a stronger, more sustainable foundation for the future of AI inference.
References
- TensorRT-LLM 1.0 Release
- TensorRT-LLM API Documentation
- TensorRT-LLM Version 0.5.0 Release
- GitHub-First Development Model
- NVIDIA FasterTransformer Repository
- NVIDIA TensorRT for LLM Inference
- torch-TensorRT Repository
- LLM Inference Benchmarking with trtllm-bench
- Experimental PyTorch Backend Release
- Technical Blog: Optimizing DeepSeek-R1 Latency on B200 GPUs
- Technical Blog: DeepSeek R1 MTP Implementation and Optimization
- Technical Blog: Optimizing DeepSeek R1 Throughput on Blackwell GPUs
- TensorRT-LLM Software Architecture Roadmap
Key Takeaways
• TensorRT-LLM officially transitioned to a PyTorch-first backend for improved performance.
• The 1.0 release ensures backward compatibility for existing LLM API users.
• This shift enhances flexibility and optimizes LLM inference on NVIDIA GPUs.