High-Performance Large Language Model Inference Framework for NVIDIA Edge Platforms
Overview | Quick Start | Documentation | Roadmap
TensorRT Edge-LLM is NVIDIA's high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. It enables efficient deployment of state-of-the-art language models on resource-constrained devices such as NVIDIA Jetson and NVIDIA DRIVE platforms. TensorRT Edge-LLM provides convenient Python scripts to convert HuggingFace checkpoints to ONNX. Engine build and end-to-end inference runs entirely on Edge platforms.
For the supported platforms, models and precisions, see the Overview. Get started with TensorRT Edge-LLM in <15 minutes. For complete installation and usage instructions, see the Quick Start Guide.
- Overview - What is TensorRT Edge-LLM and key features
- Supported Models - Complete model compatibility matrix
- Checkpoint-Based Model Loader - Recommended ONNX export pipeline
- Installation - Set up Python export pipeline and C++ runtime
- Quick Start Guide - Run your first inference in ~15 minutes
- Examples - End-to-end workflows
- Quantization - Create quantized checkpoints for
llm_loader - Experimental High-Level Python API and Server - vLLM-style API and OpenAI-compatible server
- Input Format Guide - Request format and specifications
- Chat Template Format - Chat template configuration
- Experimental Quantization Package Design - Quantization package architecture
- Legacy Python Export Pipeline - Compatibility export path;
tensorrt_edgellm/will be removed in 0.8.0 afterexperimental/quantization->experimental/llm_loaderreaches full feature parity for all models and features - Engine Builder - Building TensorRT engines
- C++ Runtime Overview - Runtime system architecture
- Customization Guide - Customizing TensorRT Edge-LLM for your needs
- TensorRT Plugins - Custom plugin development
- Tests - Comprehensive test suite for contributors
🚗 Automotive
- In-vehicle AI assistants
- Voice-controlled interfaces
- Scene understanding
- Driver assistance systems
🤖 Robotics
- Natural language interaction
- Task planning and reasoning
- Visual question answering
- Human-robot collaboration
🏭 Industrial IoT
- Equipment monitoring with NLP
- Automated inspection
- Predictive maintenance
- Voice-controlled machinery
📱 Edge Devices
- On-device chatbots
- Offline language processing
- Privacy-preserving AI
- Low-latency inference
- TensorRT Edge-LLM Jetson AI Lab tutorial
- Maximizing Memory Efficiency to Run Bigger Models on NVIDIA Jetson
- Build Next-Gen Physical AI with Edge-First LLMs for Autonomous Vehicles and Robotics
- Accelerate AI Inference for Edge and Robotics with NVIDIA Jetson T4000 and NVIDIA JetPack 7.1
- Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM
Follow our GitHub repository for the latest updates, releases, and announcements.
- Documentation: Full Documentation
- Quick Start: Quick Start Guide
- Roadmap: Developer Roadmap
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Forums: NVIDIA Developer Forums
We welcome contributions! Please see our Contributing Guidelines for details.