bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
LiveBench: A Challenging, Contamination-Free LLM Benchmark
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
A benchmark for LLMs on complicated tasks in the terminal
1 capture since 2026-05-25