Blog/Engineering

The Search Session Odyssey: Two Weeks Inside Codex and Claude

What we learned chasing AI memory across every tool that promised to remember

2025-05-12  ·  8 min read

The Problem Sounds Simple

Every developer building with AI eventually hits the same wall. Your agent had a great conversation yesterday. Today it remembers nothing.

We hit that wall in February. We were building an internal research assistant — a Claude-powered agent that needed to track evolving knowledge about our market, our competitors, our product decisions. The kind of thing where context compounds. Where what you knew last Tuesday matters for what you decide next Thursday.

The pitch was obvious. The plumbing was not.


Week One: Codex

We started with OpenAI's Codex. At the time it was the default choice for anyone who wanted an AI that could write and execute code, and we figured if anyone had solved persistent context, it was them.

The interface is elegant. You describe a task, Codex executes it, and it feels like pair programming with someone who never gets tired. For self-contained tasks — "refactor this function," "write tests for this module" — it is genuinely excellent.

But the moment you ask it to remember something across sessions, you run into the fundamental design choice: Codex treats every session as a clean slate. There is no state. There is no store. There is a context window, and when the window closes, the world ends.

We tried every workaround in the playbook.

Attempt 1: Dump state to a file, load it next session. This works for structured data. It falls apart the moment your "state" is nuanced — a web of relationships, evolving beliefs, soft context. Serializing a knowledge graph into a flat file and deserializing it faithfully is, itself, a research problem.

Attempt 2: Prepend a summary to every prompt. We wrote a summarizer that distilled the previous session into a paragraph. Codex would read it at the start of each new session. It worked about 60% of the time. The other 40%: the summary was slightly wrong, the model anchored on the wrong detail, and the session went sideways. Garbage in, garbage out — but the garbage was plausible-sounding, which made it worse.

Attempt 3: Use a vector database as external memory. We embedded every message, stored it in Pinecone, retrieved relevant chunks at query time. This is closer to right. But retrieval is not memory. Retrieval gives you a best guess at what's relevant. Memory gives you the ground truth of what happened. When you're trying to track a decision — not just find documents about it — retrieval fails in subtle ways.

By the end of week one we had a Frankenstein system held together with cron jobs and hope. It sort of worked. It definitely did not scale.


Week Two: Claude and the Search Session

We switched to Claude. Anthropic had been making noise about long-context models, and we thought: if we can fit enough history into a 100k-token context window, maybe brute-force is the answer.

The model is remarkable. Claude's ability to reason over long documents, to hold nuance, to push back when you're wrong — it is the best conversational AI we have used. None of that is in question.

The memory problem, however, is completely separate from model quality. And this is where the "search session" horror started.

Our use case required Claude to search the web, synthesize results, and update its internal model of a domain over time. We used Claude with web search tool calls. A typical session looked like this:

  1. Load yesterday's context summary (our Frankenstein system from week one)
  2. Ask Claude to search for new developments
  3. Claude issues search calls, reads results, synthesizes
  4. Write the updated summary back to our store
  5. Repeat tomorrow

On day three, something broke. Claude's summary contradicted itself. An entity it had noted as "acquired" on Tuesday was marked "independent" on Thursday. We dug into the logs.

Here is what had happened: the search tool returned stale results. The web is not a consistent database. Different searches, different times, different results. Claude faithfully summarized what it saw — which was different each day — and our system had no way to adjudicate between versions. Which summary was right? We didn't know. Claude didn't know. The "memory" was not a single source of truth; it was a series of snapshots taken in different lighting.

We spent two days on this. We tried confidence scoring. We tried asking Claude to explicitly flag uncertainty. We tried maintaining a changelog of every update and asking Claude to resolve conflicts. All of it added complexity. None of it solved the root problem.

The root problem: there was no canonical store. Every tool in our stack had its own idea of reality. The vector database had one version. The summary file had another. The current context window had a third. When they disagreed, there was no tiebreaker.


What We Actually Needed

By the end of week two we had written more infrastructure than product. And we had a very clear picture of what was missing.

We needed a shared, versioned, conflict-resolved memory layer that all of our tools — Codex, Claude, our own services — could read from and write to. Something that worked like a database, but was designed for the specific shape of AI memory: heterogeneous, frequently updated, query-time-assembled, and accessible via a simple API from any context.

Not a vector database. Not a key-value store. Not a context window. Something new.

We looked at what existed:

  • Mem0 — closest in spirit, but consumer-oriented, and the API was too rigid for our use case.
  • Zep — excellent for conversation memory, less suited for structured knowledge.
  • LangGraph Persistence — tied to LangGraph's execution model.
  • Custom RAG pipelines — powerful but you're building and maintaining the whole thing.

Every option was either too narrow, too opinionated about the execution environment, or required us to own the infrastructure.

What we wanted was a primitive. The same way S3 is a primitive for storage, or Postgres is a primitive for relational data. A primitive for AI memory — shared, durable, queryable, and model-agnostic.


The Resolution We Built

We stopped fighting the tools and started designing the layer beneath them. The result is what became EverNow.

The core insight is simple in retrospect: memory is not a model problem, it is an infrastructure problem. Claude is not going to remember your last session because that is not what language models do. But a purpose-built memory layer can remember — and then surface the right context to whatever model is handling the current request.

The architecture has three pieces:

1. A write API — any agent, app, or human can write to the memory layer. Writes are typed (facts, decisions, observations, relationships) and versioned.

2. A conflict resolution layer — when two writes disagree, the system flags the conflict and applies configurable resolution rules. Newer beats older by default. But you can write rules: "if confidence > 0.9, override; otherwise, append."

3. A query-time assembly API — at the start of any session, you query the memory layer for relevant context. It returns a structured summary, not a bag of chunks. The model sees a coherent picture, not a retrieval soup.

The result is that Codex and Claude both become stateless compute — which is what they are — running against a stateful memory layer — which is what they need. The tools do what they are good at. The infrastructure does what they are bad at.


What We Learned

After two weeks of pain, here is what we would tell anyone starting this journey:

  1. Context windows are not memory. They are scratch space. Large scratch space is better than small scratch space, but it is not the same thing as durable state.
  2. Retrieval is not memory. Vector search is a powerful tool. It is not a substitute for a canonical store. Retrieval gives you the best guess; memory gives you the ground truth.
  3. Every workaround adds complexity that compounds. The summary-file approach, the vector-database approach, the manual conflict resolution — each one introduces a new failure mode. The right answer is one layer that handles all of it.
  4. The problem is not which model you use. We switched from Codex to Claude hoping the model would solve our infrastructure problem. It didn't, because it couldn't. Model capability and memory infrastructure are orthogonal problems.
  5. The primitive is missing. S3, Postgres, Redis — these exist because storage is a generic problem that every application faces. AI memory is the same. The primitive does not yet exist. We are building it.

What Comes Next

We are in private beta. If you are building with AI agents and running into the memory problem — inconsistent state, context loss between sessions, multi-agent coordination — we would like to talk to you.

The two weeks were painful. They were also clarifying. The clearest thing we know now: the AI era needs infrastructure, not just models. And that infrastructure starts with memory.