Open highlighted repo slot
Put your repository first
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
Awesome List
Awesome list for AI agent harness engineering: tools, patterns, evals, memory, MCP, permissions, observability, and orchestration.
GitHub stars and default-branch commits for ai-boost/awesome-harness-engineering.
125 repos currently saved from this list.
Open highlighted repo slot
Promote a GitHub repo at the top of Awesome repository list views for 7 days.
Visual testing tool for MCP servers
AI Observability & Evaluation
The best agent harness.
Build effective agents using Model Context Protocol and simple workflow patterns
Open-source observability for your GenAI or LLM application, based on OpenTelemetry
From a goal to a task DAG, automatically. TypeScript-native multi-agent orchestration.
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
OpenShell is the safe, private runtime for autonomous AI agents.
[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
The first "code-first" agent framework for seamlessly planning and executing data analytics tasks.
Instant, Concurrent, Secure & Lightweight Sandbox for AI Agents.
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
Never stop coding. Free AI gateway: one endpoint, 160+ providers (50+ free), connect Claude Code, Codex, Cursor, Cline & Copilot to FREE Claude/GPT/Gemini. RTK+Caveman stacked compression saves 15-95% tokens, smart auto-fallback, MCP/A2A, multimodal APIs, Desktop/PWA.
Self-hosted AI agent orchestration platform: dispatch tasks, run multi-agent workflows, monitor spend, and govern operations from one mission control dashboard.
Orchestrate sandboxed coding agents in TypeScript with sandcastle.run()
All-in-One Sandbox for AI Agents that combines Browser, Shell, File, MCP and VSCode Server in a single Docker container.
Fast and Accurate Code Search for Agents. Uses ~98% fewer tokens than grep+read
Zep | Examples, Integrations, & More
TencentDB Agent Memory delivers fully local long-term memory for AI Agents via a 4-tier progressive pipeline, with zero external API dependencies.
autonomous harness engineering
AI observability platform for production LLM and agent systems.
Persistent memory system for AI coding agents. Agent-agnostic Go binary with SQLite + FTS5, MCP server, HTTP API, CLI, and TUI.
A meta-skill that designs domain-specific agent teams, defines specialized agents, and generates the skills they use.
The missing DevTools for Claude Code — inspect session logs, tool calls, token usage, subagents, and context window in a visual UI. Free, open source.
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Deepsec is a security harness for finding vulnerabilities in your codebase powered by coding agents
Agent harness to make your slop code well-engineered and beautiful.
Amazon Bedrock Agentcore accelerates AI agents into production with the scale, reliability, and security, critical to real-world deployment.
High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 155 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.
Squad: AI agent teams for any project
A Unified Virtual Filesystem For AI Agents
Skills, MCP servers, Custom Agents, Agents.md for SDKs to ground Coding Agents
agent-sandbox enables easy management of isolated, stateful, singleton workloads, ideal for use cases like AI agent runtimes.
Sub-millisecond VM sandboxes for AI agents via copy-on-write forking
AI Agent Governance Toolkit — Policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10.
Inspect: A framework for large language model evaluations
One brain, many harnesses. Portable .agent/ folder (memory + skills + protocols) that plugs into Claude Code, Cursor, Windsurf, OpenCode, OpenClaw, Hermes, or DIY Python — and keeps its knowledge when you switch.
Portable AI agent orchestration with mechanical protocol enforcement. 186 agents, zero runtime dependencies.
Use Claude Code's autonomous agent loop with DeepSeek V4 Pro, OpenRouter, or any Anthropic-compatible backend. Same UX, 17x cheaper.
Build, Evaluate, and Deploy GUI Agents — online RL training, standardized benchmarks, and real-device deployment in one framework.
Coding Agent singularly focused efficiency and context curation. Reduces API costs by 50-80% vs other agent AND improves the code quality at the same time. Uses Hash Anchored edits, massively parallel operations, AST manipulation and many many other optimizations. https://dirac.run/
Code and Data for Tau-Bench
a recursive self-improving harness designed to help your agents (and future iterations of those agents) succeed on any task
Give Claude Code a memory that evolves with your codebase. Hooks automatically capture sessions, the Claude Agent SDK extracts key decisions and lessons, and an LLM compiler organizes everything into structured, cross-referenced knowledge articles - inspired by Karpathy's LLM Knowledge Base architecture.
Weave is a toolkit for developing AI-powered applications, built by Weights & Biases.
Open-source AI agent harness in native Rust — GUI, CLI, headless, and webapp from one binary. Multi-provider, MCP, skills, plugins, agent teams.
Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Guardrails. Self-hostable. Apache 2.0.
Reference code for the Meta-Harness paper.
The MCP server that turns Claude into the only coding agent hitting 100% on a real benchmark. -77% active tokens, -76% wall time, 0 losses across 96 tasks on Claude Opus 4.7. Structural code navigation + persistent memory. Works with every MCP client.