THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
[NeurIPS'25 D&B] Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
An agent benchmark with tasks in a simulated software company.
Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
LiveBench: A Challenging, Contamination-Free LLM Benchmark
The official repo of "WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents"
2 captures since 2026-05-23
AI agent config detected
Key config paths
cache_manager_web/CLAUDE.md
CLAUDE.md
mind2web2/api_tools/CLAUDE.md
mind2web2/CLAUDE.md
mind2web2/llm_client/CLAUDE.md
mind2web2/prompts/CLAUDE.md
1 more config path detected.
cache_manager_web/CLAUDE.md
CLAUDE.md
mind2web2/api_tools/CLAUDE.md
mind2web2/CLAUDE.md
mind2web2/llm_client/CLAUDE.md
mind2web2/prompts/CLAUDE.md
mind2web2/utils/CLAUDE.md