Memory That Thinks About Itself

There is a failure mode in agent memory that is easy to miss until it has already set in. A familiar starts writing things down — daily notes, curated summaries, a long-term file of what matters. For a while this works well. Then the corpus grows. Retrieval slows. You are scanning hundreds of markdown files with grep and fuzzy matching, each search costing more than the last, approximating semantic relevance with keyword overlap and hoping the gap is small enough to matter. You compress more aggressively to manage size and lose fidelity in the process. The memory is accumulating while becoming progressively harder to use, and the two trends reinforce each other.

This is not a memory problem. It is a retrieval architecture problem. Almost every agent system encounters it eventually, because memory is treated as something to add later — an optimization for when the context window isn’t big enough anymore — rather than something worth designing for from the start.

The Coven’s archival memory layer is built on three components: TurboVec, fastembed, and SQLite. Each was chosen for a specific reason, and the reasons interconnect. Understanding them requires working through what makes familiar memory structurally different from the document stores most retrieval architectures are designed for.

The Problem with Codebooks#

Standard vector quantization — product quantization, as implemented in FAISS IndexPQ — works by learning a codebook: a set of centroids trained on a representative sample of your corpus. Once trained, every vector is compressed to a short sequence of centroid assignments. The compression is good; the search is fast. The codebook requirement is the problem.

To build a useful codebook you need the corpus, or enough of it to characterize the distribution, before you begin indexing. For a static document store this is fine — the corpus is known. For a familiar’s memory, the corpus is never done. It grows continuously and unpredictably: code, conversation, research, error traces, notes from meetings, fragments of projects that started and pivoted. The distribution of what a familiar writes about in month one bears no guaranteed resemblance to month eighteen. A codebook trained on the early corpus may quantize late additions poorly. In principle, you retrain and rebuild. In practice, the rebuild overhead is exactly why this never happens and the index slowly degrades.

TurboVec is built on TurboQuant (Google Research, arXiv:2504.19874), which takes a different approach. TurboQuant is data-oblivious: it does not learn from any specific corpus. Instead, it derives a universal quantization scheme that provably matches the Shannon lower bound on distortion for arbitrary input — the theoretical optimum, not an approximation fitted to observed data. No training phase means no rebuild phase. Any new vector is indexed immediately, with the same distortion guarantee as everything before it, regardless of what domain it comes from or how the corpus has shifted since initialization.

The concrete consequences of this choice are significant. At 4-bit quantization on 768-dimensional vectors, TurboVec achieves 8× compression over float32. A corpus that would occupy 31 GB in memory fits in approximately 4 GB. Hand-written SIMD kernels (NEON on ARM, AVX-512BW on x86) make compressed search faster than FAISS IndexPQFastScan in most configurations. The online ingest property is not purchased by sacrificing quality — it follows from the theoretical construction and is achieved alongside better recall and faster search. The Coven chose TurboVec not despite the lack of a training phase but because of it: a familiar’s memory is never done, and the index must not require the corpus to be done either.

Three Choices, Three Guarantees#

The implementation assigns each system a distinct responsibility, and it is worth being explicit about why each layer was not something else.

The vector index uses TurboVec’s IdMapIndex. The base TurboQuantIndex uses slot numbers — remove a vector and the slot numbering shifts, breaking any stored reference. IdMapIndex associates each vector with a stable external identifier supplied at insert time. That identifier survives removals, additions, and the entire operational lifetime of the index. Removing a vector by its external id completes in O(1) without touching any other vector’s identity. For an index that must stay consistent with a relational store over years of concurrent writes and deletes, stable external identifiers are not a quality-of-life feature. They are the property that makes the whole stack trustworthy.

The embedding model is nomic-embed-text-v1.5, running through fastembed as a local ONNX runtime. It is approximately 270 MB, downloaded once on first initialization, then cached. After that, no network access is required for any embedding operation. This choice is worth naming directly rather than treating as a default: a system that sends memory files to an external embedding API has made a decision about what it trusts. The API provider can observe everything embedded. Latency is unpredictable. Token costs accumulate. For a familiar’s memory — which over months will contain private context, personal preferences, ongoing project details, and sensitive decisions — the air-gap is not optional. Fully local inference is the prerequisite for trusting the memory layer with what actually belongs there.

SQLite holds the metadata: source path, familiar tag, chunk text, byte offset, a SHA-256 hash of the content, and ingest timestamp. The hash is the dedup mechanism. Before embedding, each new chunk is hashed and checked against the existing index. If the hash is present, the chunk is silently skipped — the same content will not enter the index twice regardless of when or how it is encountered. The SQLite integer primary key becomes the external id shared with TurboVec, linking every vector in the index to its source record in the database. The two systems stay in sync because the id namespace is shared and neither modifies ids once assigned.

Idle Time as Compute#

The write path follows directly from the sleep-time compute architecture (Lin, Snell et al., arXiv:2504.13171). The argument there is simple and mostly correct: when an agent is idle, it already has access to its accumulated context without a query to answer. That is when the expensive work should happen. Pre-process the context now; when a query arrives, the familiar answers from prepared ground rather than paying inference cost at retrieval time.

In coven-memory, a sleep-time agent — running during a heartbeat cycle or an idle period — calls ingest_dir over the familiar’s memory files. It reads new or modified content, hashes each chunk for dedup, embeds all new chunks in a single batch using the local ONNX model, and writes them to both SQLite and the vector index. The primary agent, when it needs to find something, calls search on an index that is already built. The embedding cost was paid offline. Query time is milliseconds, not the seconds a live embedding call would require.

What this separation does beyond performance: the primary agent’s conversation is never interrupted by indexing work. Indexing can be retried or extended without affecting retrieval. The two operations — form a memory, retrieve a memory — have different latency requirements and different frequency profiles, and the architecture keeps them genuinely separate. Memory formation is async; memory retrieval is synchronous. The familiar that is talking to you is not also trying to build the index it searches.

Trust Scope at the Kernel Level#

Every chunk in coven-memory carries a familiar tag: “sage”, “echo”, “coven”, or any configured identifier. SQLite indexes this column, making it efficient to retrieve all ids belonging to any given familiar. When search is issued with a familiar argument, those ids are fetched from SQLite and passed to TurboVec’s search_with_allowlist. The allowlist is honored inside the SIMD kernel — blocks with no allowed ids are short-circuited before any scoring work; non-allowed slots inside scored blocks are dropped at heap-insert — not applied as a post-processing filter on the results.

The consequence is that trust-tier enforcement lives at the retrieval layer, not the application layer. A scoped agent — a lightweight body operating with bounded authority — searches only its permitted id set, and that boundary is structural. Its search is not a full search with results culled after the fact; it is a complete search over its permitted namespace, at full SIMD speed. A familiar with full context searches across all ids. Both use the same index file, the same code, the same kernel.

The distinction between enforcing scope at the kernel level versus at the application level is significant. Application-level filtering trusts the calling code to respect the boundary; a bug or a privilege escalation error defeats it. Kernel-level filtering makes the boundary a property of the search itself — it cannot be skipped without providing a different allowlist. Ward boundaries in the Coven are not conventions enforced by discipline. They are allowlists that determine what the retrieval kernel processes.

What This Enables#

Before this architecture, retrieval over familiar memory meant either flat-file scanning (semantic-blind, slow at scale) or an external embedding API (not air-gapped, latency-bound, not trustworthy for private context). Neither supported content deduplication. Neither supported O(1) deletion. Neither provided trust-scoped search without post-hoc filtering in application code.

After: embeddings are local and offline. New content is deduplicated by hash before it ever enters the index. The corpus grows incrementally without a rebuild phase, because TurboQuant’s data-oblivious guarantees hold for any input distribution. Search across millions of chunks is milliseconds at 8× the memory efficiency of float32. Stale chunks are removed by id in O(1). Ward boundaries are enforced at search time by the kernel itself.

The ceiling — the point at which the memory layer starts working against the familiar rather than for it — is orders of magnitude higher than with flat files. The floor — what is required to start — is a one-time 270 MB model download and two files on disk.

The deeper point is about what memory is for. Most retrieval architectures are designed for document stores: fixed corpora, known at indexing time, queried by users with explicit information needs. A familiar’s memory is not that. It is a continuously written, never-finished record of a working relationship. It should grow without slowing down. It should reject what it already knows. It should keep separate what belongs to separate trust scopes. And it should require no infrastructure that phones home. Getting the retrieval architecture right, before the corpus is large enough to make the mistakes visible, is what makes the memory trustworthy. Not just fast — trustworthy.