Ayanami0730/deep_research_bench
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
LiveBench: A Challenging, Contamination-Free LLM Benchmark
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
An agent benchmark with tasks in a simulated software company.
A benchmark for LLMs on complicated tasks in the terminal
1 capture since 2026-05-25