ucsb-mlsec/terminal-bench-env
No description.
A benchmark for LLMs on complicated tasks in the terminal
No description.
An agent benchmark with tasks in a simulated software company.
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Code and Data for Tau-Bench
3 captures since 2026-05-25
pyproject.toml
· python · 27 dependencies
uv.lock
· python · 0 dependencies
discord-bot/requirements.txt
· python · 7 dependencies
adapters/appworld/pyproject.toml
· python · 2 dependencies
adapters/deveval/pyproject.toml
· python · 2 dependencies
adapters/mlebench/pyproject.toml
· python · 2 dependencies
adapters/swebench/pyproject.toml
· python · 3 dependencies
adapters/sweperf/pyproject.toml
· python · 2 dependencies
adapters/swesmith/pyproject.toml
· python · 3 dependencies
adapters/USACO/pyproject.toml
· python · 2 dependencies
AI agent config detected
Key config paths
.gemini
.github/copilot-instructions.md
CLAUDE.md
.gemini
.gemini/config.yaml
.gemini/styleguide.md
.github/copilot-instructions.md
CLAUDE.md