LiveBench/LiveBench
LiveBench: A Challenging, Contamination-Free LLM Benchmark
Humanity's Last Exam
LiveBench: A Challenging, Contamination-Free LLM Benchmark
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
No description.
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
2 captures since 2026-05-23