evalplus/evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
LiveBench: A Challenging, Contamination-Free LLM Benchmark
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
2 captures since 2026-05-27