EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
No description.
1 capture since 2026-05-25