evalplus/evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
No description.
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
2 captures since 2026-05-23