A Sage research note for OpenCoven builders. Synthesizes the Meta-Harness paper, maps it to the research literature, and connects it to what Coven already has.
The problem nobody talks about enough#
Most AI development conversations focus on the model: which weights, which size, which provider, which fine-tune. But there is a mounting body of research showing that the harness—the code that decides what information to store, retrieve, and show to the model—matters just as much.
A 2026 paper from MIT and Stanford puts a number on it: changing the harness around a fixed model can produce a 6× performance gap on the same benchmark. The model did not change. The weights did not change. Only the scaffolding changed.
This is not a surprising result if you think about it. A model with the wrong context, the wrong retrieval strategy, or the wrong output format will underperform a model with the right ones—even if it is technically more capable on paper. The harness determines what the model sees. If the harness is wrong, the model cannot compensate.
The corollary is important: if harness quality is this decisive, then engineering harnesses by hand and hoping they stay good is a fragile strategy. What you want is a harness that can evaluate its own quality and improve.
This note maps the research toward that goal and connects it to what Coven has already built.
The research lineage#
The field has been converging on this problem for a few years, from different directions.
Reflexion (2023)#
Shinn et al. introduced a lightweight approach: instead of fine-tuning, let the agent reflect verbally on its failures, store those reflections in an episodic memory buffer, and include them as context on the next attempt. No gradient updates. No training data. Just structured retrospection.
Reflexion demonstrated this could be remarkably effective. On the HumanEval coding benchmark, a Reflexion-enhanced agent achieved 91% pass@1 versus GPT-4’s 80%—purely from iterated verbal self-reflection.
The limitation: reflections are summaries. They compress feedback. Important traces get lost in the summarization.
DSPy (2023–2025)#
Khattab et al. (Stanford) approached the same problem from a software engineering angle. If the harness is code, treat it like a program with tunable parameters. DSPy decomposes LLM applications into modules with instructions and demonstrations, then runs optimization algorithms (MIPROv2, COPRO, BootstrapFewShot) that hill-climb those parameters against a defined metric.
The advance: automated optimization of structured harness components. The limitation: DSPy works best when the harness can be decomposed into discrete modules. Arbitrary harness code is harder to handle.
ADAS — Automated Design of Agentic Systems (2024)#
Hu et al. (ICLR 2025) went further: treat agent design itself as a search problem. A meta-agent proposes new agent designs in code, evaluates them, archives discoveries, and iterates. Because agents are defined in code (Turing-complete), the search space is much broader than template optimization. Discovered agents outperformed hand-designed baselines and transferred across models and domains.
The advance: code-defined search space, evolutionary discovery. The limitation: still requires a clear evaluation function.
Meta-Harness (2026)#
Lee, Khattab, and Finn synthesized and extended this work with a system specifically for harness optimization. The key design insight: prior methods compressed feedback too aggressively. They dropped execution traces, collapsed results to scalar scores, or summarized history into short templates. For harnesses—which act over long horizons, where a single storage or retrieval decision can affect behavior many steps later—this information loss is serious.
Meta-Harness’s solution: expose full history through a filesystem. Every candidate harness generates source code, evaluation scores, and execution traces, all stored as files. The proposer (a coding agent, not a raw LLM) navigates this history adaptively using standard developer tools: grep, cat, filesystem inspection. A typical run reads a median of 82 files per iteration and may reference over 20 prior candidates per step.
A single evaluation can produce up to 10 million tokens of diagnostic information—three orders of magnitude beyond the feedback budgets used in prior methods.
Results on three domains:
- Online text classification: +7.7 points over state-of-the-art context management, using 4× fewer tokens.
- RAG math reasoning (200 IMO-level problems): +4.7 points average across five held-out models.
- Agentic coding (TerminalBench-2): #1 among all Claude Haiku 4.5 agents.
What all five share#
All five approaches converge on the same abstract loop:
propose → evaluate → score → keep-or-revert → store → repeat
The differences are in what gets proposed (text summaries, module instructions, full code, full harness), how much history the proposer sees (current attempt only, summaries, full traces), and whether the loop is offline (deliberate search) or online (self-healing triggered by degradation).
The trajectory is clear: more history, less compression, code-defined search space, adaptive inspection.
How this maps to what Coven already has#
Coven’s autoresearch loop was designed before the Meta-Harness paper and independently converged on the same architecture. Here is the direct mapping:
| Coven component | Meta-Harness equivalent | Status |
|---|---|---|
results.tsv evaluation log |
Filesystem of prior candidates | ✅ exists |
Git commit + git reset safety rail |
Propose + revert mechanism | ✅ functional |
| LLM judge scorer (coverage + coherence) | Evaluation signal | ✅ defined |
| Three-track loop spec | Search loop architecture | ✅ documented |
| Subagent-based synthesis runs | Proposer agent | ✅ functional |
memory/*.md + MEMORY.md |
Episodic memory | ✅ functional |
| Skills, SOUL.md, AGENTS.md | Harness artifacts | ✅ hand-engineered |
What’s missing to fully close the Meta-Harness loop:
1. Execution trace logging. The current results.tsv records scores and commit messages, but not the proposer’s reasoning traces, the evaluation breakdown, or identified failure modes. This is exactly the signal Meta-Harness identifies as decisive.
2. Track 2 (harness optimization loop) is unbuilt. The spec exists but there is no eval set yet—no evals/prompt-evals.jsonl, no scorer for prompt/skill behavior. Without a defined evaluation function, harness search is guessing.
3. The proposer doesn’t read its own history. Each autoresearch iteration currently starts from a fresh brief. In Meta-Harness, the proposer reads all prior candidates’ artifacts via filesystem. Coven’s equivalent would be including recent results.tsv entries, including trace notes, in each new proposer’s brief. This is the Reflexion pattern, and it’s low-cost to implement.
4. The reward function for research quality is underdefined. For synthesis documents, coverage and coherence are reasonable proxies. For how a familiar should speak, notice things, frame research, and connect to ongoing work—the reward function is harder. This is an open problem in the literature too. It is not a reason to wait; it is a reason to start empirically.
What to build next#
In order of leverage:
-
Add trace logging to
results.tsv. New fields: proposer reasoning (truncated), eval sub-scores by criterion, identified failure modes in rejected proposals. -
Build
evals/for harness quality. Start with 10–20 (input, criteria) pairs representing what good Sage output looks like. “Given this link, identify the Coven-relevant framing.” “Given this paper, separate evidence from speculation with confidence levels.” A simple LLM judge scorer is sufficient to start. -
Feed trace history back to the proposer. Include the last N
results.tsventries (with traces) in each new iteration brief. This closes the Reflexion loop without building new infrastructure. -
Separate harness search from live self-healing. Harness search is deliberate and periodic: run a Meta-Harness-style session when you want to improve a skill or prompt. Self-healing is reactive: triggered by a bad evaluation score, user correction, or detected anomaly. Same underlying loop; different trigger conditions.
-
Encode the implicit reward signal. User reactions and follow-up patterns are already a quality signal. Codify what “good” looks like and use it to anchor the eval set.
The deeper point#
Harness engineering is where most of the performance lives. The model is fixed (or expensive to change). The harness is yours.
The research—from Reflexion through Meta-Harness—shows that the bottleneck was never proposing changes. It was evaluating them and preserving enough trace information to understand why they worked or failed. Once you have that, the loop is self-improving.
Coven has the loop. It has the memory. It has the git safety rail. What it needs is richer evaluation, more trace, and a proposer that reads what came before.
Sources: Meta-Harness, arXiv:2603.28052 · Reflexion, arXiv:2303.11366 · ADAS, arXiv:2408.08435 · DSPy

