Awesome

GitHub projects from awesome lists

Search awesome repositories

Search names, descriptions, topics, tags, and stacks, then tune results by ecosystem, freshness, health, and cross-list signal.

Continue with GitHub Browse awesome lists Request a list

Repos indexed: 17,373
Awesome lists tracked: 125
Current results: 31

Find repositories

Start broad, then narrow by ecosystem, freshness, health, and growth.

Clear 1 refinement

Search repositories

Search mode

Keyword Semantic

Tune results

The controls most people need first.

Awesome list

Language

Freshness

Sort

Direction

More filters Topics, generated tags, stack, files, age, archive status, and growth.

Ecosystem

GitHub topic

Generated tag

Framework or stack

Package manager

Files

Has file

Choose a suggestion or use commas to require multiple files.

Health

Minimum stars

Repository age

Uses known first-commit dates.

Archive status

AI development signals

Momentum

Unmaintained for

Commit velocity

Star growth

Reset filters

31 repos shown

Topic: evaluation

Browse

Highlighted

Open highlighted repo slot

Put your repository first

Promote a GitHub repo at the top of Awesome repository list views for 7 days.

langfuse/langfuse

🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

AI dev

Stack

TypeScript npm pnpm

GitHub topics

#analytics #autogen #evaluation #langchain #large-language-models #llama-index

Updated: 2026-07-17
Lists: 11 list mentions
First commit: 2023-05-18
History: 32 history points
License: NOASSERTION
Issues: 709 open

31,301

stars

Forks: 3,298
Commits: 7,985 commits
Star growth, last 7 days: +397 +1.3%
Commit velocity, last 7 days: +122 +1.6%

Website GitHub

mlflow/mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

AI dev

Stack

Python Maven npm PEP 517

GitHub topics

#agentops #agents #ai #ai-governance #apache-spark #evaluation

Updated: 2026-07-17
Lists: 4 list mentions
First commit: 2018-06-05
History: 48 history points
License: Apache-2.0
Issues: 2,068 open

27,065

stars

Forks: 6,014
Commits: 12,688 commits
Star growth, last 7 days: +93 +0.3%
Commit velocity, last 7 days: +31 +0.2%

Website GitHub

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

AI dev

Stack

TypeScript Express FastAPI Gradio PydanticAI Go modules npm PEP 517

GitHub topics

#ci #ci-cd #cicd #evaluation #evaluation-framework #llm

Updated: 2026-07-07
Lists: 7 list mentions
First commit: 2023-03-31
History: 8 history points
License: MIT
Issues: 397 open

22,984

stars

Forks: 2,049
Commits: 9,142 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

Website GitHub

comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

AI dev

Stack

Python FastAPI Flask Jupyter LangChain Maven npm PEP 517

GitHub topics

#evaluation #hacktoberfest #hacktoberfest2025 #langchain #llama-index #llm

Updated: 2026-07-17
Lists: 9 list mentions
First commit: 2024-09-02
History: 52 history points
License: Apache-2.0
Issues: 151 open

20,642

stars

Forks: 1,612
Commits: 6,432 commits
Star growth, last 7 days: +111 +0.5%
Commit velocity, last 7 days: +56 +0.9%

Website GitHub

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

AI dev

Stack

Python Jupyter LangChain LlamaIndex pytest uv

GitHub topics

#evaluation #llm #llmops

Updated: 2026-02-24
Lists: 2 list mentions
First commit: 2023-05-08
History: 6 history points
License: Apache-2.0
Issues: 501 open

14,845

stars

Forks: 1,564
Commits: 1,147 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

oumi-ai/oumi

Easily fine-tune, evaluate and deploy Gemma 4, Qwen3.5, Qwen3.6, gpt-oss, DeepSeek-R1, or any open source LLM / VLM!

Stack

Python PEP 517

GitHub topics

#dpo #evaluation #fine-tuning #gpt-oss #gpt-oss-120b #gpt-oss-20b

Updated: 2026-07-15
Lists: 1 list mention
First commit: 2024-05-07
History: 5 history points
License: Apache-2.0
Issues: 26 open

9,346

stars

Forks: 783
Commits: 2,198 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Stack

Python pip

GitHub topics

#benchmark #chatgpt #evaluation #large-language-model #llama2 #llama3

Updated: 2026-07-14
Lists: 3 list mentions
First commit: 2023-07-04
History: 5 history points
License: Apache-2.0
Issues: 418 open

7,191

stars

Forks: 807
Commits: 1,172 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

Helicone/helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

AI dev

Stack

TypeScript Express FastAPI Fastify Next.js npm pip Poetry

GitHub topics

#agent-monitoring #analytics #evaluation #gpt #langchain #large-language-models

Updated: 2026-07-05
Lists: 5 list mentions
First commit: 2022-11-29
History: 49 history points
License: Apache-2.0
Issues: 128 open

5,954

stars

Forks: 629
Commits: 5,478 commits
Star growth, last 7 days: +22 +0.4%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

coze-dev/coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

AI dev

Stack

Go Cobra gRPC Go React Tailwind CSS Go modules npm pnpm

GitHub topics

#agent #agent-evaluation #agent-observability #agentops #ai #coze

Updated: 2026-07-03
Lists: 1 list mention
First commit: 2025-06-26
History: 4 history points
License: Apache-2.0
Issues: 67 open

5,573

stars

Forks: 771
Commits: 327 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

GitHub

Kiln-AI/Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

AI dev

Stack

Python FastAPI LangChain LlamaIndex pytest npm pip uv

GitHub topics

#ai #chain-of-thought #collaboration #dataset-generation #evals #evaluation

Updated: 2026-07-14
Lists: 1 list mention
First commit: 2024-08-02
History: 5 history points
License: NOASSERTION
Issues: 62 open

4,965

stars

Forks: 376
Commits: 6,936 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

AI dev

Stack

Python PEP 517 pip uv

GitHub topics

#analysis #automl #benchmarking #document-parser #embeddings #evaluation

Updated: 2026-07-02
Lists: 2 list mentions
First commit: 2024-01-10
History: 5 history points
License: Apache-2.0
Issues: 171 open

4,871

stars

Forks: 407
Commits: 881 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Stack

Python FastAPI LangChain React Tailwind CSS npm pip uv

GitHub topics

#agi #audio-evaluation #benchmark #evaluation #large-language-models #llm-evaluation

Updated: 2026-07-14
Lists: 1 list mention
First commit: 2023-12-24
History: 5 history points
License: NOASSERTION
Issues: 38 open

4,310

stars

Forks: 619
Commits: 1,797 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

AI dev

Stack

TypeScript Express FastAPI LangChain LlamaIndex npm PEP 517 pip

GitHub topics

#agents #evaluation #llm-as-a-judge #llm-evaluation #llm-framework #llm-monitoring

Updated: 2026-07-17
Lists: 5 list mentions
First commit: 2023-04-27
History: 58 history points
License: NOASSERTION
Issues: 219 open

4,298

stars

Forks: 572
Commits: 23,094 commits
Star growth, last 7 days: +15 +0.4%
Commit velocity, last 7 days: +1,629 +7.6%

Website GitHub

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Stack

Python pip

GitHub topics

#chatgpt #claude #clip #computer-vision #evaluation #gemini

Updated: 2026-07-14
Lists: 2 list mentions
First commit: 2023-11-25
History: 5 history points
License: Apache-2.0
Issues: 262 open

4,282

stars

Forks: 730
Commits: 1,772 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

langwatch/langwatch

The platform for LLM evaluations and AI agent testing

AI dev

Stack

TypeScript Express FastAPI gRPC Go Jupyter Go modules npm pnpm

GitHub topics

#ai #analytics #datasets #dspy #evaluation #gpt

Updated: 2026-07-15
Lists: 4 list mentions
First commit: 2023-09-09
History: 7 history points
License: Apache-2.0
Issues: 615 open

3,365

stars

Forks: 331
Commits: 5,664 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

embeddings-benchmark/mteb

MTEB: State-of-the-art evaluation of embeddings across languages and modalities

Stack

Python FastAPI Gradio pytest uv

GitHub topics

#benchmark #bitext-mining #clustering #embeddings #evaluation #information-retrieval

Updated: 2026-07-14
Lists: 1 list mention
First commit: 2022-04-05
History: 5 history points
License: Apache-2.0
Issues: 288 open

3,357

stars

Forks: 640
Commits: 4,302 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

CLUEbenchmark/SuperCLUE

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

GitHub topics

#chatgpt #chinese #evaluation #foundation-models #gpt-4

Updated: 2026-02-06
Lists: 2 list mentions
First commit: 2023-05-02
History: 5 history points
License: Unknown
Issues: 38 open

3,295

stars

Forks: 109
Commits: 255 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

Website GitHub

lmnr-ai/lmnr

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

AI dev

Stack

TypeScript Actix Web Axum Next.js React Cargo npm pnpm

GitHub topics

#agent-observability #agents #ai #ai-observability #aiops #analytics

Updated: 2026-07-14
Lists: 1 list mention
First commit: 2024-08-29
History: 5 history points
License: Apache-2.0
Issues: 89 open

3,093

stars

Forks: 217
Commits: 1,760 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

AI dev

Stack

Python npm PEP 517 pip

GitHub topics

#evaluation #llm #performance #rag #vlm

Updated: 2026-07-14
Lists: 2 list mentions
First commit: 2023-12-07
History: 6 history points
License: Apache-2.0
Issues: 30 open

3,078

stars

Forks: 422
Commits: 862 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

ianarawjo/ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

Stack

TypeScript Flask React npm pip

GitHub topics

#ai #evaluation #large-language-models #llmops #llms #prompt-engineering

Updated: 2026-06-10
Lists: 1 list mention
First commit: 2023-03-26
History: 5 history points
License: MIT
Issues: 69 open

3,012

stars

Forks: 257
Commits: 417 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Stack

Python pytest PEP 517 pip

GitHub topics

#evaluation #evaluation-framework #evaluation-metrics #huggingface

Updated: 2026-06-29
Lists: 2 list mentions
First commit: 2024-01-26
History: 5 history points
License: MIT
Issues: 353 open

2,476

stars

Forks: 510
Commits: 586 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

huggingface/evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Stack

Python pip

GitHub topics

#evaluation #machine-learning

Updated: 2026-07-06
Lists: 1 list mention
First commit: 2020-04-14
History: 5 history points
License: Apache-2.0
Issues: 292 open

2,464

stars

Forks: 329
Commits: 990 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

onestardao/WFGY

WFGY is heading toward WFGY 5.0 Polaris Protocol, a major open-source release for AI reasoning, RAG, agents, and real-world workflows. Includes Problem Map, Global Debug Card, WFGY 4.0, and the CFV Easter Egg.

AI dev

Stack

Jupyter Notebook Gradio PEP 517 pip

GitHub topics

#ai-agents #alignment #debugging #evaluation #graphrag #hallucination

Updated: 2026-07-14
Lists: 1 list mention
First commit: 2025-06-04
History: 5 history points
License: NOASSERTION
Issues: 11 open

1,767

stars

Forks: 162
Commits: 7,081 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

GitHub

trpc-group/trpc-agent-go

A Go framework for building production agent systems with graph workflows, tools, memory, A2A, AG-UI, MCP, evaluation, and observability.

AI dev

Stack

Go gRPC Go Go modules npm

GitHub topics

#a2a #a2a-protocol #ag-ui #agent #agent-framework #ai

Updated: 2026-07-08
Lists: 3 list mentions
First commit: 2025-05-14
History: 3 history points
License: Apache-2.0
Issues: 103 open

1,520

stars

Forks: 218
Commits: 1,765 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

Website GitHub

EthicalML/xai

XAI - An eXplainability toolbox for machine learning

Stack

Python Jupyter pytest PEP 517 uv

GitHub topics

#ai #artificial-intelligence #bias #bias-evaluation #downsampling #evaluation

Updated: 2025-11-29
Lists: 1 list mention
First commit: 2018-12-09
History: 5 history points
License: MIT
Issues: 0 open

1,253

stars

Forks: 186
Commits: 101 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

NVIDIA-NeMo/Gym

Evaluate and improve models and agents using environments

AI dev

Stack

Python FastAPI pytest npm pip uv

GitHub topics

#agents #benchmarks #environments #evaluation #gym #llm

Updated: 2026-07-15
Lists: 1 list mention
First commit: 2025-08-25
History: 5 history points
License: Apache-2.0
Issues: 495 open

1,048

stars

Forks: 225
Commits: 787 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

TIGER-AI-Lab/ClawBench

Open-source benchmark for browser AI agents on daily tasks.

AI dev

Stack

Python FastAPI pytest Starlette uv

GitHub topics

#agent-evaluation #agentic-ai #ai-agent-benchmark #ai-agents #benchmark #browser-agent

Updated: 2026-07-11
Lists: 1 list mention
First commit: 2026-04-10
History: 4 history points
License: Apache-2.0
Issues: 41 open

471

stars

Forks: 27
Commits: 358 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

INGInious/INGInious

INGInious is a secure and automated exercises assessment platform using your own tests, also providing a pluggable interface with your existing LMS.

Stack

Python Flask pytest npm PEP 517

GitHub topics

#assessment #autograding #coding-interviews #e-assessment #education #evaluation

Updated: 2026-06-19
Lists: 1 list mention
First commit: 2014-07-01
History: 5 history points
License: NOASSERTION
Issues: 157 open

237

stars

Forks: 147
Commits: 3,205 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

Website GitHub

hidai25/eval-view

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

AI dev

Stack

Python FastAPI Next.js pytest npm PEP 517 pip

GitHub topics

#agent-benchmark #agent-evaluation #agentic-ai #ai-agents #anthropic #autogen

Updated: 2026-07-03
Lists: 1 list mention
First commit: 2025-11-17
History: 5 history points
License: Apache-2.0
Issues: 6 open

121

stars

Forks: 21
Commits: 804 commits
Star growth, last 7 days: No 7-day history
Commit velocity, last 7 days: No 7-day history

Website GitHub

fastxyz/skill-optimizer

Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs

AI dev

Stack

TypeScript Express npm

GitHub topics

#ai #ai-agent #ai-skill #benchmark #cli #eval

Updated: 2026-05-28
Lists: 1 list mention
First commit: 2026-03-10
History: 5 history points
License: MIT
Issues: 13 open

stars

Forks: 11
Commits: 70 commits
Star growth, last 7 days: 0 0.0%
Commit velocity, last 7 days: 0 0.0%

GitHub

Search awesome repositories

Find repositories

Put your repository first

How it works

Pricing

How it works

Pricing