sierra-research/tau-bench
Code and Data for Tau-Bench
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Code and Data for Tau-Bench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
A benchmark for LLMs on complicated tasks in the terminal
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
An agent benchmark with tasks in a simulated software company.
Collection of evals for Inspect AI
1 capture since 2026-05-25
AI agent config detected
Key config paths
.cursor
AGENTS.md
src/tau2/agent/AGENTS.md
src/tau2/domains/AGENTS.md
src/tau2/evaluator/AGENTS.md
src/tau2/voice/AGENTS.md
2 more config paths detected.
.cursor
.cursor/rules
.cursor/rules/audio-native-provider.md
.cursor/rules/background-audio-files.md
.cursor/rules/nova-sonic.md
AGENTS.md
src/tau2/agent/AGENTS.md
src/tau2/domains/AGENTS.md
src/tau2/evaluator/AGENTS.md
src/tau2/voice/AGENTS.md
src/tau2/voice/audio_native/AGENTS.md
tests/AGENTS.md