THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
ICML 2024: Improving Factuality and Reasoning in Language Models through Multiagent Debate
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
Code for Arxiv 2023: Improving Language Model Negociation with Self-Play and In-Context Learning from AI Feedback
[ICLR 2026] LLM/VLM gaming agents and model evaluation through games.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
data-to-paper: Backward-traceable AI-driven scientific research
3 captures since 2026-05-22
requirements.txt
· python · 4 dependencies