SWE-bench/SWE-bench
SWE-bench: Can Language Models Resolve Real-world Github Issues?
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
SWE-bench: Can Language Models Resolve Real-world Github Issues?
Evaluation harness for OpenHands V1.
Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
An agent benchmark with tasks in a simulated software company.
1 capture since 2026-05-25