openai/simple-evals
No description.
bloom - evaluate any behavior immediately 🌸🌱
No description.
LiveBench: A Challenging, Contamination-Free LLM Benchmark
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
Toolkit for linearizing PDFs for LLM datasets/training
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
1 capture since 2026-05-25