triton-lang/triton
Development repository for the Triton language and compiler
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Development repository for the Triton language and compiler
Fast inference engine for Transformer models
A Datacenter Scale Distributed Inference Serving Framework
A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, etc. to optimize inference speed.
All-in-one platform for search, recommendations, RAG, and analytics offered via API
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
1 capture since 2026-05-25