TIGER-AI-Lab/ClawBench
Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
No description.
Build, Evaluate, and Deploy GUI Agents — online RL training, standardized benchmarks, and real-device deployment in one framework.
See your agent think. Real-time observability dashboard for OpenClaw AI agents.
1 capture since 2026-05-25