You Can't Retrieve What You Don't Know You're Missing
There is a field in one of the systems I work on that has been deprecated for years and comes back empty every single time you read it. But it is named in such a way that every agent thinks that is the one field that they should really really focus on. And it's always empty! Thus, the agent declares it's the wrong environment and starts searching for the correct one among the local containers, tries to see if it assumed the wrong launch config, and basically goes tokenmaxxing on a wild goose chase.
The agent did nothing dumb, but still failed. The field had a name that sounded like it mattered, so it used it. "But is this secretly deprecated tho?" was missing from the agent's idea of what questions were worth asking. This would get a human maybe once or twice tops. But it will get the poor agent every time I don't end the prompt with "...and watch for that field. It's deprecated but not marked as such". This is basically how you get the next HAL, the next Skynet, the next Butlerian Jihad...
This is a long standing problem, because it is the one that no amount of better retrieval fixes, and almost everyone building agent memory right now is building better retrieval. We the agentic AI gourmands, have a tendency towards bit of a herd mentality. The sheer volume of buzz surrounding LLMs often turns a reasonable sounding hypothesis into an assumed fact, cascading into widespread architectural fallacies.
We have seen this happen with RAG, people assumed if there is enough cosine similarity (or HNSW with closest neighbors etc.) the relevant parts of context would elegantly surface. When this did not happen, it was "RAG is dead" or "Graph RAG is everything". In the process, RAG has changed meanings from simply augmenting the generation via retrieval, to using graph or vector searches with each prompt.
Ultimately the puzzle of retrieving the context that the agent doesn't know that it doesn't know, has not been solved as of yet. I didn't solve it either. But I will propose a design pattern, that I believe to be quietly and spontaneously climbing its way towards becoming the de facto standard. I will lay it out along with a lightweight library, that will enable you to personally observe why this pattern is empirically sound.
Unknown unknowns
I'm but a humble dev, and far be it from me to criticize Karpathy the magnifique. But he can at times be a great trigger of the herd mentality. He has recently come up with a lovely knowledge management paradigm called "LLM Wiki", a fairly decent pattern that hinges on 3 feet. Ingestion, querying and gardening. This 3 prong scheme has a glaring weakness that I'd like to criticize. This requires the agent to initiate the querying. The agent realizes it is unsure about something, forms a query, calls a tool, gets context back, carries on. Vector store, knowledge graph, grep over a notes folder, a Jira MCP, it does not matter. The pattern is the same. The agent decides it has a gap, then goes and fills it.
There is a precondition buried in that loop: the agent has to know the gap is there to pull. While recent models have shown improvement in this regard, you just can't rely on them as of now. LLMs are still notorious for just making up the gap. If a model sees "topic" and "publish" in your docs without explicitly naming the technology, it assumes Kafka and is like, "GCP who?". If it is supposed to pull a DLQ entry, it will waste a quarter of its context window trying to figure out the Kafka setup. Possibly even conclude the implementation is missing and start to implement one.
The half that is missing then, is push: something that finds the fact the agent lacks and injects it in the background. Something that will ask "Oh are we doing CQRS? Alright let me go find out about the messaging architecture".
Somatic memory with hooks
Firstly, we need to briefly explore what hooks are. Hooks are runtime events that the harness exposes. These are usually things like "Stop" where the LLM finishes responding or "UserSubmitPrompt" which fires when the user sends the prompt, but before it reaches the model. The user can attach executables to these events. These executables "automatically" run without intentional invocation by the agent. For instance, every time my model makes a tool call, a programlet can check if it contains banned commands. Or on SessionEnd, I can upload the conversation to somewhere.
Just as MCP protocol has become the standard for pull oriented tools, an emerging surface for push tooling is the hooks that are exposed by most harnesses. We will be utilizing both. Anthropic first published the MCP standard, now it looks like their hooks are becoming the standard. Below is a table for foremost legacy harnesses and a few recent up and coming ones, that adopt hooks with some divergence in their standards:
| Harness | Cohort | User Submit hook equivalent | Wire contract vs Claude Code |
|---|---|---|---|
| Claude Code (Anthropic) | established | UserPromptSubmit |
original gangster |
| Codex CLI (OpenAI) / Qwen Code (Alibaba) | established / 2026 | UserPromptSubmit |
adopt it wholesale |
| Gemini CLI (Google) | established | BeforeAgent |
adopts it, renames the events |
| Cursor (Anysphere) | established | beforeSubmitPrompt |
its own vocabulary but maps to it |
| pi (Mario Zechner) | 2026 newcomer | input (pi.on("input"), can rewrite the prompt) |
its own JS extension API, not stdin but adaptable |
| CodeWhale (formerly deepseek-tui) | 2026 newcomer | message_submit |
adopting it now (issue #1364) |
Look down that last column. An increasing percentage of harnesses either takes Claude Code's wire format wholesale or with a new vocabulary onto it, the most divergent example being pi. When I looked at the 7 of the high popularity harnesses that came out in 2026, I found 5 of them have hooks comparable to Anthropic's. The rate drops the earlier you go. So this is something of an "underground" movement among the cooler people for now.
What makes it worth pointing at is how little noise it made on the way in. MCP arrived with a campaign behind it. "USB-C for LLMs" was the line, it was a good line, and within a season not adopting MCP had started to look like a decision you would be asked to defend. Hooks got none of that, no "this is now the standard", and the harnesses settled on Claude Code's contract regardless, one after another, each of them arriving at the same need on its own. An "emergent" standard, where human was very much in the loop.
That retrieval step is a free variable. It can be grep over a folder of markdown, a vector store, a knowledge graph, or a NotebookLM instance that is filled with the past chats. Swap one for another and nothing above it has to move. You have your body of knowledge and a sense of what the unknown unknowns can be. The design pattern around it can be relatively stable. In fact, let us do a minimal hands on experiment with it.
A small library you can run
Back near the top I promised a small library you could run yourself to watch the pattern in operation. It is at: github.com/esinecan/somatic-memory
It is called somatic-memory for now. Not the best name since the MCP half of it is opposite of somatic. Somatic systems of the body work without our conscious input. The speed of your pulse increases as your muscle output does. It doesn't require you to think "Let me send more blood to my calves". Your iris reacts to light.
Which is why only one of the two components fits the name. The hook is truly "somatic". It fires on your prompt and pushes context in before the model reads a word, no action from the model needed. Like the iris reacting to light. But we do also squint when it's too bright right? The other component is a plain MCP tool called recall, and the agent reaches for it the moment it knows it has a gap. That one is a deliberate, conscious query, the exact opposite of somatic (agentic?), and corresponds to the squint.
In the end, all this engine does one small thing: given the conversation, hand back what is relevant. Because it only reads and never writes, the thing it reads from is entirely up to you: a folder, a vault, a database, a live search. The library calls each one an oracle. Treat them as scaffolding for your customizations. Although the ones besides Web oracle are legitimately usable as is.
| Oracle | What it reads |
|---|---|
| filesystem | a folder of text files, the zero-config default |
| obsidian | the same, plus it follows wikilinks one hop into linked notes you never searched for |
| sqlite | full-text search over a database something else filled |
| web | a live search fetch, for a source you neither own nor wrote |
Swap one row for another and nothing above it has to change. That is the free variable from earlier. I point my own setup at a cloud notebook I already keep, NotebookLM as it happens, but the filesystem oracle needs no cloud and no account at all. The plainest version is grep over a folder of notes, and that already does most of the work. A surprisingly effective way that I will always stand by.
The design space
The library above is aimed being the simplest representation of a design pattern. To me, contextsmithing via harness hooks (setting aside "which hook" question) have 4 basic axes. These are immediate angles to fork and tailor the project above for yourself:
1. Personal against institutional: A continuity setup remembers one person across their own sessions and harnesses, the daemon that has been reading over your shoulder for months. An institutional one holds what a team or an org knows, the shared store every member's agents read from. The engine is the same. What changes is whose memory it is and who it answers for.
2. Cloud against local: The oracle can be a hosted thing you query over the network, a cloud notebook or a managed database, or it can be a folder of files on your own disk that you search with grep. Nothing in the pattern prefers one. A simple folder somewhere in C:/, containing just a bunch of lowly, barely past stone age txt files, that get ripgrepped, synthesized and pushed to the main agent's context in a "somatic" manner, is genuinely surprisingly beneficial.
3. Read only against bidirectional: Everything so far only reads. A bidirectional version also writes, noticing when you correct the agent and keeping that lesson for next time. The write path is where the real difficulty lives, telling a genuine correction from a passing change of mind and keeping the store from rotting as it fills, which is why it is a later piece and not this one.
4. Sync against async: A sync hook blocks the prompt going to LLM until the context is assembled. An async one fires and lets the context arrive some time (depending on what you're having it do) later, while the agent is already working. A fully autonomous run, no human, can afford to block, and blocking means the agent does not start down a wrong path. But in interactive sessions a non-blocking somatic retrieval would be the smarter choice. The past context maybe arrives a little later in the background. I can afford that while I am actively monitoring.
A fair worry at this point is cost. A fresh retrieval pass on every prompt sounds like a lot. The push agent's brain is yours to size, though. At the cheap end it is regex and heuristic tomfoolery that never calls a model at all. At the other end it is the most capable, most "too dangerous to ship" frontier model you can find. Cost and speed move with that choice.
This is just one framing of the architecture, and probably not a complete one. None of the four axes has a default answer either. At work you might want the whole team sharing a cloud, knowledge-graph oracle that can also do backflips. For your very first home assistant, a filesystem ripgrep that returns thirty lines on either side of each hit could be plenty.
The middle ground keeps getting cheaper. Slotting in a small local model to run smarter searches used to be a luxury, but with capable local models everywhere now it often costs less than letting the main model go exploring on its own. Let's walk through a couple of concrete setups:
My home setup could be categorized as "personal, cloud, unidirectional, async" based on these axes. For home PCs of myself and my wife, I have notebooklm instances that are asked up to 3 questions by a Deepseek agent who interpreted the whole context. Once its questions are answered, same Deepseek agent synthesizes its tips and injects back into context between "<subconscious>" tags. It's a rather hefty one but it's great at creating the UX of one continuous identity. The agent only does read only stuff. Our chats are wholesale ingested into NotebookLM instances by a separate cron job.
When I was going to write this article, I scoured the land for similar architectures. It was a great sanity check to find Tomaz Bratanic's Neo4j based setup that has a very similar design. I can sense the 4 axes in his thinking as well: "personal, local, bidirectional, and sync". It is personal in the same sense mine is, one developer's memory carried across Claude Code, Codex and Cursor. It is local, a Neo4j DB he runs himself instead of a hosted service. He has a "dream phase" implementation that corresponds to linting and ingestion phases of Papa Karpathy's "LLM Wiki" but similar to mine, it is decoupled from the hook + mcp remembering architecture, but because his hooks log every event into Neo4j, it is bidirectional.
I have also found some fellow travelers in CodeWhale. I liked it quite a lot as a community driven tool and wanted to try out, but its hooks could watch a user message go by without changing it, so the somatic injection I lean on elsewhere had no clean way in. So I filed an issue (#1364) asking for two things: a message_submit hook that can actually rewrite the text before it reaches the model, and a turn-end event for the fire-and-forget critique pass. What happened next was the encouraging part. The maintainer (Hmbown) agreed it made sense, another contributor (LeXwDeX) argued the hook layer should just mirror Claude Code's contract one to one, a third (AresNing) wrote the first implementation, and within weeks the maintainer put it together into a PR. I did not have to push any of it along. The thread filled up with people who already wanted the same thing, which is that quiet convergence from the table happening right in front of me, one human-in-the-loop decision at a time.
Pull already has a name, agentic RAG, and the MCP tool in here is exactly that, kept for every time the agent does know it has a gap. The push half never got one, neglected maybe in the backwash of semantic RAG disillusionment. I call it somatic RAG. Agentic RAG, meet your other half. My setup runs both against the same memory stores, and I do not read them as a trade-off.
So go back to that field from the start. The agent that burned a third of its window chasing the wrong environment was not stupid and was not short on retrieval. It failed because the one fact that would have saved it, that the field is dead and comes back empty, was never in its idea of what to ask. No bigger window and no better embedding reaches that, because the query never fires. The only thing that reaches it is something already watching the conversation, probing near that fact, ready to rush it back the moment it surfaces. You cannot retrieve what you do not know you are missing. Something else has to push it to you.
This skeleton will get converged on from a lot of directions, with write-back, team versions, gardeners and ontologies layered on top. The code is at the repo, the name is a placeholder, and there is nothing to install. Clone it, point it at a folder of your own notes, and wait for it to surprise you with some nuance you forgot you knew.