llm-d/llm-d
Achieve state of the art inference performance with modern accelerators on Kubernetes
Optimized vLLM deployment for NVIDIA Blackwell (RTX 5090) on Linux Kernel 6.14. Resolves SM_120 kernel incompatibilities, P2P deadlocks, and memory fragmentation for high-performance LLM inference.
Achieve state of the art inference performance with modern accelerators on Kubernetes
Fast Multimodal LLM on Mobile Devices
🦖 X—LLM: Cutting Edge & Easy LLM Finetuning
A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, etc. to optimize inference speed.
A high-throughput and memory-efficient inference and serving engine for LLMs
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
1 capture since 2026-05-25