bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
A benchmark for LLMs on complicated tasks in the terminal
LiveBench: A Challenging, Contamination-Free LLM Benchmark
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
1 capture since 2026-05-27