harbor-framework/terminal-bench
A benchmark for LLMs on complicated tasks in the terminal
Microbenchmarking app for Swift with nice log-log plots
A benchmark for LLMs on complicated tasks in the terminal
Wide NoSQL benchmark for RocksDB, LevelDB, Redis, WiredTiger and MongoDB extending the Yahoo Cloud Serving Benchmark
benchmark tooling that loves you ❤️
Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.
Bring your own agent and build a self-improving agentic system. Automatically mine failures, optimize the agent harness, and gate against regressions.
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
5 captures since 2026-05-22