awslabs/agent-evaluation
A generative AI-powered framework for testing virtual agents.
Repository profile
The Continuous-Improvement Stack for Agents. Our environment data and evals power agent improvement and monitoring.
Tracked growth, recent movement, and commit velocity from stored repository snapshots.
Latest capture 2026-06-12 10:50
1 capture since 2026-06-12
Stars from baseline 0
Frameworks, package managers, ecosystems, and dependency manifests found during catalog scans.
Scanned 2026-06-12 10:50
pyproject.toml
python ecosystem,
37 dependencies
uv.lock
python ecosystem,
0 dependencies
examples/basic-distributed-tracing/pyproject.toml
python ecosystem,
4 dependencies
examples/basic-evaluation/pyproject.toml
python ecosystem,
1 dependency
examples/basic-linked-trace/pyproject.toml
python ecosystem,
1 dependency
examples/basic-tracing/pyproject.toml
python ecosystem,
2 dependencies
examples/claude-agent-sdk/pyproject.toml
python ecosystem,
2 dependencies
examples/google-adk/pyproject.toml
python ecosystem,
2 dependencies
Searchable topics, generated tags, and stack labels that explain where this repository fits.
Agent instructions and tool configuration paths found in the repository tree.
Nearest indexed repositories by embedding similarity.
A generative AI-powered framework for testing virtual agents.
The self-improving Agentic QA harness with Memory. Write tests in natural language. Catch regressions before releases ship.
An agent benchmark with tasks in a simulated software company.
Evaluation and Tracking for LLM Experiments and AI Agents
BenchClaw — Multi-dimensional AI agent evaluation with 17-judge AI Tribunal, 10 scoring dimensions, radar charts, and deception detection. Benchmark any LLM agent.
🧠 Make your agents learn from experience. Now available as a hosted solution at kayba.ai