openai/mle-bench
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
MTEB: Massive Text Embedding Benchmark
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
A benchmark for LLMs on complicated tasks in the terminal
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
1 capture since 2026-05-25