Glossary

TensorRT

Optimize deep learning models with TensorRT for faster, efficient inference on NVIDIA GPUs. Achieve real-time performance with YOLO and AI applications.

TensorRT is a high-performance software development kit (SDK) developed by NVIDIA specifically for optimizing and executing deep learning models. It functions as a specialized inference engine that takes trained neural networks and restructures them to run with maximum efficiency on NVIDIA Graphics Processing Units (GPUs). By streamlining the computational graph and managing memory usage, TensorRT significantly reduces inference latency and increases throughput, making it an essential tool for developers building applications that require immediate, real-time responses.

How TensorRT Optimizes Performance

The primary goal of TensorRT is to bridge the gap between a model trained in a flexible framework and a model deployed for speed. It achieves this through several sophisticated optimization techniques:

Layer Fusion and Graph Optimization: TensorRT analyzes the network architecture and fuses multiple layers into a single operation. For instance, it might combine a convolution layer with a bias and activation step. This reduction in the number of operations minimizes the overhead of launching kernels on the GPU.
Precision Calibration: To further accelerate performance, TensorRT supports model quantization. This process converts model weights from standard 32-bit floating-point (FP32) to lower precision formats like mixed precision (FP16) or 8-bit integers (INT8). This drastically reduces memory bandwidth usage while maintaining high accuracy.
Kernel Auto-Tuning: Different GPU architectures handle mathematical operations differently. TensorRT automatically selects the best data layers and algorithms from a vast library of optimized kernels, ensuring the model runs optimally on the specific target hardware, such as an NVIDIA Jetson or a data center A100.
Dynamic Tensor Memory: The SDK optimizes memory allocation by reusing memory for tensors (data containers) that are not needed simultaneously, effectively reducing the overall memory footprint during model deployment.

Real-World Applications of TensorRT

Because of its ability to process massive amounts of data with minimal delay, TensorRT is widely adopted in industries relying on computer vision and complex AI tasks.

Autonomous Vehicles: In the field of AI in automotive, self-driving cars must process video feeds from multiple cameras to detect pedestrians, signs, and obstacles instantly. Using TensorRT, perception models like object detection networks can analyze frames in milliseconds, allowing the vehicle's control system to make safety-critical decisions without lag.
Smart Manufacturing: Modern factories utilize AI in manufacturing for automated optical inspection. High-speed cameras capture images of products on assembly lines, and TensorRT-optimized models identify defects or anomalies in real time. This ensures that quality control keeps pace with high-speed production environments, often deploying on edge AI devices right on the factory floor.

Using TensorRT with Ultralytics YOLO11

Integrating TensorRT into your workflow is straightforward with modern AI tools. The ultralytics package provides a seamless method to convert standard PyTorch models into TensorRT engines. This allows users to leverage the state-of-the-art architecture of Ultralytics YOLO11 with the hardware acceleration of NVIDIA GPUs.

The following example demonstrates how to export a YOLO11 model to a TensorRT engine file (.engine) and use it for inference:

from ultralytics import YOLO

# Load a pretrained YOLO11 model
model = YOLO("yolo11n.pt")

# Export the model to TensorRT format (creates 'yolo11n.engine')
# This step optimizes the model for the specific GPU currently in use
model.export(format="engine")

# Load the optimized TensorRT model for high-speed inference
tensorrt_model = YOLO("yolo11n.engine")
results = tensorrt_model("https://ultralytics.com/images/bus.jpg")

TensorRT vs. Other Inference Technologies

It is important to distinguish TensorRT from other tools in the machine learning ecosystem.

TensorRT vs. Training Frameworks: Libraries like PyTorch and TensorFlow are designed primarily for training models, prioritizing flexibility and ease of debugging. TensorRT is strictly for inference, prioritizing raw speed and efficiency on specific hardware.
TensorRT vs. ONNX Runtime: The ONNX (Open Neural Network Exchange) format is designed for interoperability across different platforms. While ONNX Runtime is a versatile engine that runs on various hardware, TensorRT provides deeper, hardware-specific optimizations exclusive to NVIDIA GPUs, often yielding higher performance than generic runners.
TensorRT vs. OpenVINO: Similar to how TensorRT is optimized for NVIDIA hardware, the OpenVINO toolkit is designed to accelerate inference on Intel processors (CPUs and integrated GPUs). Choosing between them depends entirely on your deployment hardware.

For scalable cloud deployments, TensorRT engines are frequently served using the NVIDIA Triton Inference Server, which manages model versions and handles concurrent requests efficiently.

TensorRT

Train Ultralytics YOLO models to streamline workflows across industries

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

How TensorRT Optimizes Performance

Real-World Applications of TensorRT

Using TensorRT with Ultralytics YOLO11

TensorRT vs. Other Inference Technologies

Read more in this category

Future object detection trends: 7 key things to look out for

Enhancing vehicle re-identification with Ultralytics YOLO models

Improving collision prediction with Ultralytics YOLO models

Join the Ultralytics community