openai/simple-evals
No description.
Python intefrace for evaluation on chatgpt models
No description.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Code and Data for Tau-Bench
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
1 capture since 2026-05-27