TheAgentCompany/TheAgentCompany
An agent benchmark with tasks in a simulated software company.
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
An agent benchmark with tasks in a simulated software company.
A generalized information-seeking agent system with Large Language Models (LLMs).
[ICLR 2026] LLM/VLM gaming agents and model evaluation through games.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
A benchmark for LLMs on complicated tasks in the terminal
[NeurIPS'25 D&B] Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge
3 captures since 2026-05-23