THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
[ICLR 2026] LLM/VLM gaming agents and model evaluation through games.
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
A generalized information-seeking agent system with Large Language Models (LLMs).
An agent benchmark with tasks in a simulated software company.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
A lightweight framework for building LLM-based agents
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
1 capture since 2026-06-02
pyproject.toml
· python · 26 dependencies
requirements.txt
· python · 26 dependencies