Dao-AILab/flash-attention
Fast and memory-efficient exact attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Fast and memory-efficient exact attention
FlashInfer: Kernel Library for LLM Serving
An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
High-speed Large Language Model Serving for Local Deployment
Easy and Efficient Finetuning LLMs. (Supported LLama, LLama2, LLama3, Qwen, Baichuan, GLM , Falcon) 大模型高效量化训练+部署.
PyTorch native quantization and sparsity for training and inference
5 captures since 2026-06-04
pyproject.toml
· python · 2 dependencies
setup.py
· python · 0 dependencies