sierra-research/tau2-bench
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Code and Data for Tau-Bench
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
A benchmark for LLMs on complicated tasks in the terminal
An agent benchmark with tasks in a simulated software company.
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
No description.
2 captures since 2026-05-23