Open highlighted repo slot
Put your repository first
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
Awesome-list intelligence for GitHub
Discover projects curated by awesome-list maintainers, then narrow them by stars, age, freshness, archive status, language, topics, generated tags, detected stacks, package managers, and source list.
Open highlighted repo slot
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
๐ชข Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. ๐YC W23
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Supercharge Your LLM Application Evaluations ๐
Easily fine-tune, evaluate and deploy Gemma 4, Qwen3.5, Qwen3.6, gpt-oss, DeepSeek-R1, or any open source LLM / VLM!
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
๐ง Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 ๐
Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.
Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
SuperCLUE: ไธญๆ้็จๅคงๆจกๅ็ปผๅๆงๅบๅ | A Benchmark for Foundation Models in Chinese
The platform for LLM evaluations and AI agent testing
An open-source visual programming environment for battle-testing prompts to LLMs.
Laminar - open-source observability platform purpose-built for AI agents. YC S24.
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
๐ค Evaluate: A library for easily evaluating machine learning models and datasets.
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
WFGY is heading toward WFGY 5.0 Polaris Protocol, a major open-source release for AI reasoning, RAG, agents, and real-world workflows. Includes Problem Map, Global Debug Card, WFGY 4.0, and the CFV Easter Egg.
XAI - An eXplainability toolbox for machine learning
Evaluate and improve models and agents using environments
Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.
INGInious is a secure and automated exercises assessment platform using your own tests, also providing a pluggable interface with your existing LMS.
Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs