openai/mle-bench
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
An agent benchmark with tasks in a simulated software company.
Tongyi Deep Research, the Leading Open-source Deep Research Agent
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
2 captures since 2026-05-23