harbor-framework/terminal-bench
A benchmark for LLMs on complicated tasks in the terminal
No description.
A benchmark for LLMs on complicated tasks in the terminal
An agent benchmark with tasks in a simulated software company.
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
1 capture since 2026-05-25