InternLM/WildClawBench
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
No description.
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
1 capture since 2026-05-30
AI agent config detected
Key config paths
AGENTS.md