Like an Onion, Agentic Memory Has Layers

Most people say “memory” like it’s one thing, then get frustrated when it doesn’t behave that way. In practice, you’re dealing with at least four layers: context window, session state, compiled context, and durable memory. Pull them apart, and the behavior stops feeling random.

  • Published
  • Reading Time 9 min read
  • Tags
    • ai
    • memory
    • agents
    • architecture

Hacker News: “ChatGPT forgets everything every time you close the tab.” LangChain forum: “What’s the difference between checkpointing and long-term memory?” Product manager: “We need memory for our agent.” CTO across the hall: “We already have memory. We use a vector database.”

Same word. Not even close to the same thing.

This is the first expensive mistake in building agentic systems: treating memory as a single concept. It is not. It is at least four separable things that got collapsed under one word, and you cannot make good design decisions, or even use these systems well day to day, until you pull them apart.

The four things we keep conflating

1. The context window

The prompt you hand the model for a single turn. Not memory. More like RAM. It is working space, it evaporates the moment the call returns, and every LLM call rebuilds it from scratch.

When users complain the model “forgot” something from earlier in the chat, they are almost always noticing that the context window got too full and older turns fell out. That is not a memory failure. It is a capacity failure. Fixing it requires eviction policy, not persistence.

2. Session state

What lets a chat feel like a chat. The transcript. The record of what was said, what tools ran, what the model decided. It lives on disk or in a database somewhere, and the system reads enough of it back before each turn to reconstruct context.

ChatGPT’s history sidebar is this. LangGraph’s checkpoints are this. OpenAI’s conversation-state model is this. Without it, every message is a cold start.

Session state is replay infrastructure. It is not the same as remembering things about you across conversations.

3. Compiled context

This is the in-between layer almost nobody names, but it is where most of the leverage is. Summaries, cached packs, handoff bundles, pinned project notes, policy briefs, distilled artifact digests. Anything pre-processed into a form the model can ingest efficiently belongs here.

Anthropic now makes prompt caching a first-class API primitive, and Google does the same with context caching. Claude Code’s CLAUDE.md is compiled context. A project brief you hand a new agent is compiled context. A compaction summary is compiled context.

Compiled context is not raw history and is not durable typed memory. It is condensed reference material designed to slot into the prompt cheaply and repeatedly.

4. Durable memory

What people actually mean when they say “the agent should remember me.” Cross-session facts and preferences. “The user prefers concise answers.” “We decided last month to use Postgres, not Mongo.” “This project’s deadline is Friday.” This layer outlives the conversation. It is what ChatGPT’s memory feature is trying to do at the product level, even if OpenAI describes it in terms of saved memories and chat-history-derived personalization rather than a typed developer-facing store. It is also what Mem0, Letta, and Vertex AI Agent Engine Memory Bank are built around.

Durable memory is the only one of these four that actually has identity across time. It is also the one most systems get wrong, because they treat it as a bucket to dump summaries into, rather than a typed, scoped, revisioned store.

Why the conflation is expensive

If you believe memory is one thing, you build one system. You pick a vector database, call it memory, and move on. Then you discover:

  • Your “memory” makes the model slower because every turn re-embeds the whole transcript.
  • Your agent “remembers” things from conversations the user thought were private.
  • You cannot tell whether something the agent claims to recall came from the current thread, last week’s conversation, a compacted summary, or a stray scratchpad.
  • When the user says “that’s wrong, forget that,” you have no idea which layer to correct.
  • When the user starts a new project, every memory from every other project floods in.

These are not retrieval problems. They are category errors dressed up as retrieval problems.

The tier model

The cleanest way to hold all four in your head is as tiers, with increasing durability and decreasing volume.

Four stacked memory tiers, from top to bottom: Working, Session, Compiled, Durable. Between each pair, two arrows show bidirectional information flow — upward promotion and downward compilation. WORKING this turn’s prompt · rebuilt every call SESSION the chat transcript · replayed each turn COMPILED summaries and packs · pre-processed to reuse DURABLE facts that outlive the conversation

Information flows both ways. A preference observed in working context gets promoted to durable memory. Durable memory gets compiled into a project pack that gets pinned into working context at the start of the next session. A session gets compacted into a summary that may or may not deserve promotion to durable. Whether it does is a governance decision, not a storage decision.

This is MemGPT’s big insight, reframed. Your LLM is a CPU, your context window is RAM, and everything else is storage at different tiers with different access patterns.

Two orthogonal dimensions: class and scope

Once you have tiers, you need two more distinctions.

Class, the shape of a memory

A user preference is not the same thing as a project decision, an artifact digest, or a speculative hypothesis. Each has different fields, lifecycle rules, and retrieval semantics.

A working taxonomy:

  • fact — a semantic claim about the world
  • preference — a user’s standing choice
  • decision — a project ruling with rationale and alternatives considered
  • artifact — a reference to a canonical source with version
  • procedure — a how-to or runbook
  • policy — a rule that must be followed
  • hypothesis — a speculative claim, explicitly non-authoritative
  • episode — a time-bound event trace

Why this matters: retrieval should behave differently for each. A decision’s rationale should be preserved in full. A hypothesis should never surface as if it were fact. A preference should supersede its predecessor instead of appending. An artifact should know its source version so you can invalidate it when the source changes.

Scope, the boundaries on visibility

Scope is orthogonal to class, but it is not a single linear ladder. It is better modeled as multiple intersecting dimensions, each with its own hierarchy. Any given memory belongs to a set of coordinates, and effective retrieval is the intersection across them.

Take an image-generation agent named Pixel. Her memory lives along several axes at once.

  • Craft/domain axis. Camera technique, composition rules, color theory. Stable across every job she ever takes.
  • Agent axis. Pixel’s own operating style, tone, voice. Distinct from the other agents in the same workspace.
  • Project-type axis. Sports photography and portrait sessions have different conventions, lighting defaults, pacing. A memory good for one is wrong for the other.
  • Client/relationship axis. What this specific client likes, what they rejected last time, their brand guidelines.
  • Task/thread axis. What just happened in the current conversation.
  • Identity axis. User → team → workspace → org.
Six scope axes radiate from a center point: craft, agent, project, client, task, identity. Along each axis, tick marks indicate possible positions. One node on each axis is highlighted and connected to form a polygon representing the current memory retrieval envelope. CRAFT AGENT PROJECT CLIENT TASK IDENTITY

When Pixel retrieves memory for a turn, the relevant set is not “everything under a node in a tree.” It is the intersection of her current coordinates along every axis. This agent + this project-type + this client + this task, plus the standing craft memory that applies to all of them.

Claude Code’s precedence model is a simplified linear version of this, where more specific memory files override broader ones (for example, project-level guidance can override user-level guidance). Letta’s attachable memory blocks are closer to the real shape. Each block is a coordinate and an agent’s effective memory is the union of the blocks currently attached to it. Google Vertex AI Memory Bank formalizes the access side with IAM conditions over scope expressions.

The practical rule every serious system ends up at. Scope must fail closed. If you cannot resolve a retrieval’s coordinates, refuse, do not widen. Most privacy leaks in memory systems are silent scope-widening in disguise. “Just one more fallback” becomes “why did the agent just tell me about my coworker’s project?”

Classes and scopes interact: conditional memories

The class/scope split is useful, but in practice it is not fully orthogonal. Some of the most valuable memories are conditional. Their applicability depends on the current coordinates along other axes.

“My boss prefers green” is rarely universally true. It is “my boss prefers green when we’re pitching an environmental client.” The same human has different preferences for retail pitches, healthcare, legal. Surface that memory in the wrong context and it becomes noise or, worse, a confident wrong answer.

This shows up everywhere real agents work.

  • A decision that only applies when the project type is X.
  • A policy that only fires when handling a certain data category.
  • A procedure that varies by client.
  • An episodic pattern that only generalizes within a narrow condition.

Current systems handle this inconsistently. Mem0 attaches arbitrary metadata and lets you filter at retrieval time. LangGraph uses composite namespaces. Letta’s attachable blocks are a rougher analogue: attach the “environmental-client” block to the right agents in the right projects, and its contents become active. Google Vertex AI Memory Bank supports scope expressions that can encode conditions directly.

The underlying insight. A memory’s retrievability is often a predicate over the current context, not a single label. A cleanly typed preference can still need “applies-when: client.type = environmental” attached to it. Good systems make that predicate explicit and inspectable. Weak ones leave it implicit in how the retrieval query happens to be built. Which is exactly how you get a “preference” that is accidentally right most of the time and confidently wrong the rest.

Pipelines are not tiers

One more distinction is worth naming. Tiers (working, session, compiled, durable) are where memory lives. Pipelines are the processes that move information between tiers, and they are orthogonal to the tier model.

Two pipelines worth knowing by name because they show up under different labels in almost every serious system.

Dreaming. A process that runs over raw session traces and recall evidence and produces compiled artifacts: summaries, reflections, thematic groupings, consolidated briefs. The output of dreaming is compiled context, not durable memory. Its job is to produce better inputs for the next tier up. The Generative Agents paper calls this “reflection.” Vertex AI Agent Engine Memory Bank frames a similar idea as consolidation.

Habit formation. A separate process that decides what, if anything, from the compiled tier should become durable. This is where reinforcement curves, frequency thresholds, confidence gates, and conflict resolution live. When a preference is observed repeatedly, or a decision is confirmed, a promotion pipeline writes it into durable memory with provenance, class, and supersession links. Mem0’s information-extraction, conflict-resolution, and storage pipeline is this. Claude Code’s auto-memory is a lightweight version.

Other common pipelines: Compaction (session → compiled summary), extraction (raw transcript → typed durable memory), reflection (episodic traces → higher-level insight), invalidation (marking durable memories stale when source changes or supersession is detected), and revalidation (checking whether a memory’s source version still matches).

The important move is to keep tiers and pipelines cleanly separated in your head. A tier is a place. A pipeline is a policy. Most memory bugs are either a tier confusion (treating a compaction summary as durable truth) or a pipeline confusion (letting frequency alone justify durable promotion, which is how systems end up “remembering” their own dream output as fact).

Why this frame matters

Almost every serious problem in agent memory turns out to be one of four errors in disguise.

  • Tier error. Treating a compaction summary like durable truth. Letting the context window double as the memory model. Dumping raw transcripts into the durable store.
  • Class error. Storing a hypothesis next to a fact with equal retrieval weight. Treating all memories as flat text instead of typed records.
  • Scope error. Pulling the wrong project’s decisions into the current thread. Letting a private scratchpad promote into shared memory.
  • Lifecycle error. Never invalidating. Never superseding. Never noticing the source changed. Never distinguishing stale from current.

These are the categories. Storage and retrieval are tactics you choose after you have made these decisions, not before.

Where this goes from here

The field agrees more than you might think on the shape of memory: multi-tier, typed, scoped, conditional. Where it diverges is on governance, promotion policy, and how much of the memory logic belongs in the runtime versus the store.

That is the thread I want to pull on next. What Letta, LangGraph, Mem0, Google Vertex AI Memory Bank, Claude Code, AutoGen, and OpenAI’s Agents SDK all independently discovered about how to build these layers, and where they meaningfully diverge when it actually matters.

If you are building any of this, the single most useful thing you can do right now is stop saying “memory” and start saying which of the four you mean.