Mark Hayden — Thoughts

Congrats on Your Agent Memory System. Here's Where It Falls Apart.

Mark Hayden — Wed, 22 Apr 2026 12:00:00 GMT

I looked at seven teams that shipped memory systems for agents. A research lab, a hyperscaler, a YC startup, an open-source project, a CLI tool, a framework consultancy, and a managed memory platform. No shared playbook between them.

Across those seven systems, the same five decisions kept showing up. The same five cracks did too.

Convergence isn't proof anyone is right. Whole fields can agree and be wrong together. But when seven groups with little in common keep arriving at the same architecture, the overlap tells you something real about the shape of the problem. The disagreements tell you where the work isn't finished. The cracks tell you where it breaks in production.

This post explores all three.

The cast, with links for anyone who wants to go read the source:

What follows is my synthesis from these seven public systems, not a claim that the whole field has settled on one design.

LangGraph. Thread-scoped short-term memory (one conversation, one thread) paired with namespace-scoped long-term memory (one namespace per user or project).
OpenAI's Agents SDK. By default, sessions prepend stored conversation history before each run, with long-term memory left to whatever store you pair with it.
Claude Code. CLAUDE.md files and auto-memory, treating instructional memory (how to work, not just what you said) as first-class.
Microsoft AutoGen. Memory as a protocol (add, query, update_context, clear, close) with an explicit context-update step that runs right before each model call.
Mem0. Managed memory (hosted service you call into) with extraction, conflict resolution, and a dashboard.
Letta. Composable memory blocks (small, named chunks of memory) attachable and editable at runtime.
Google Vertex AI Memory Bank. Managed long-term memory with revisions enabled by default, TTL controls, and IAM conditions (Google's access-control system) layered over scopes.

The five decisions

1. Session state and durable memory are never the same system

Cleanest signal in the set. Across the seven systems I reviewed, "what was said in this conversation" and "what should outlive this conversation" are handled as separate concerns. They look similar on the surface. The access patterns, lifetimes, and failure modes aren't.

LangGraph has thread-scoped checkpoints for one and namespace-scoped stores for the other. OpenAI's Agents SDK auto-prepends and auto-appends turns but explicitly leaves durable memory to whatever you pair with it. Mem0 splits conversation, session, user, and org. Letta pins memory blocks in the prompt and persists messages separately, retrievable but not auto-replayed. Vertex AI Memory Bank treats session history and long-term memory as distinct resources with different lifecycles. Claude Code starts with a fresh context window and handles continuity through CLAUDE.md and auto-memory instead of replaying transcripts.

The underlying point is the same. Compaction is session hygiene. Promotion is a separate decision with separate policy. Conflate them and you break both.

2. Retrieval gets its own phase, right before the model call

Across these seven systems, retrieval shows up as an explicit assembly or context-update step immediately before execution. Not as a side effect of some storage layer. A governable step.

AutoGen names this most cleanly. Its memory protocol has a dedicated update_context operation whose only job is mutating the active context before the model runs. OpenAI Agents does it during input assembly. Letta injects attached blocks into the system prompt at run time. LangGraph reads state at every step boundary. Mem0's standard flow is query → assemble → prompt on each call.

This matters because governable retrieval is where all the policy actually hangs. Scope filters, staleness checks, confidence thresholds, explanation payloads. If retrieval is hidden inside the model call, none of that is possible. If it's a step you own, it's a step you can instrument.

# Illustrative pseudocode based on AutoGen's Memory protocol.
# `update_context` is the hook where memory mutates context before the model runs.
class MyMemory(Memory):
    async def update_context(self, ctx: ChatContext) -> UpdateContextResult:
        hits = await self.query(ctx.last_user_message())
        ctx.add_system_message(format_memories(hits))
        return UpdateContextResult(memories=hits)

3. Promotion needs policy. No blind append.

The weakest systems dump raw transcripts into durable memory and let retrieval sort it out later. The strongest ones treat every promotion from session to durable as an explicit pipeline with gates at each step.

Mem0 runs an extract-resolve-store flow on add: infer structured memories, detect conflicts with existing entries, resolve with "latest truth wins," then store. The inferred memory has different authority than the raw transcript it came from. Claude Code's auto-memory watches for user corrections and repeated preferences, not every observation that goes by. Letta uses editable blocks rather than auto-promotion. Agents write blocks when something deserves it. Vertex AI Memory Bank runs extraction and consolidation as async background jobs, off the hot path.

The shared insight, compressed. Raw experience is not memory. Memory is what survives a promotion policy.

4. Scope is a first-class retrieval and safety control

Across the seven systems, scope is treated as part of the query shape or attachment model, not a filter slapped on after the fact.

Mem0 is built around entity-scoped memory via identifiers like user_id, agent_id, app_id, and run_id. LangGraph's stores are namespace-scoped, typically composite (user plus project). Letta's attachable blocks mean an agent's effective memory is the union of whatever blocks are currently attached. Vertex AI Memory Bank enforces identity-scoped isolation and supports IAM conditions that can express "this principal can read memories where scope.project = X." Claude Code loads managed, project, user, and local instruction sources with more specific guidance typically having the last word in practice.

The convergent rule in the systems above. Scope must fail closed. If the scope can't be resolved, refuse. Don't widen. The reason isn't just relevance. It's privacy. Silent scope-widening is how memory leaks across projects, across users, across agents. "Just one more fallback" becomes "why did the agent just tell me about my coworker's project."

# Fail-closed scope: no resolvable scope = refuse, never silently widen
if not user_id:
    raise ScopeError("user_id required; refusing rather than widening to global")

results = store.search(("memories", user_id), query=q)

5. Pinned memory beats retrieval for top-priority facts

Every system in this set that handles long-running agents ends up with a pinned lane. Memory that lives in the prompt unconditionally, not filtered through ranking.

Claude Code loads CLAUDE.md and auto-memory at session start. Not retrieved. Present. Letta's "core" blocks are injected directly into the system prompt. Other memory is retrievable but not pinned. OpenAI Agents session items auto-prepend before each run, subject to history limits or compaction if you enable them. LangGraph reads state at step start unconditionally.

Why retrieval is insufficient for the high-priority layer: retrieval is ranking, and ranking has variance. For preferences that must always apply and policies that must always fire, variance is the enemy. Pinning removes the variance by making the memory part of the prompt's base instead of its results.

The tradeoff is obvious. Pinned memory costs tokens every call. So these systems tend toward a small pinned layer and a larger retrieved layer. Which is MemGPT's tier model again, just dressed differently.

Each team's distinctive decision

What makes each system its own thing:

LangGraph. Cleanest public split between short-term memory (per conversation) and long-term memory (per user or project). On top of that, an explicit taxonomy: semantic (facts), episodic (events), procedural (how-to).
OpenAI's Agents SDK. Continuity first. By default, sessions replay stored history into the next run, with optional history limits and compaction. Durable memory is left to whatever store you pair with it.
Claude Code. Instructional memory as first-class. CLAUDE.md treats repo conventions and workflow rules as a memory class in their own right, separate from facts about the user.
Microsoft AutoGen. Memory as a formal protocol. Five operations any memory implementation has to provide: add, query, update_context, clear, close. The update_context operation is the explicit phase where retrieval meets the prompt, called right before the model runs.
Mem0. Managed memory as a product surface. One hosted service handles extraction (pulling memories out of conversation), conflict resolution (when two memories disagree), scoped search, and a dashboard. One flow instead of seven you assemble yourself.
Letta. Memory as composable objects. Blocks can be pinned in the prompt, attached to an agent, shared across agents, and edited at runtime by the agent itself. An agent can rewrite its own memory mid-task.
Google Vertex AI Memory Bank. Memory with operational rigor. Revisions are enabled by default (and can be disabled), TTL controls how long memories or revisions persist, deleted memories have a limited recovery window, and IAM conditions support access rules like "this principal can only read memories where project = X".

Where they actually disagree

Convergence gets the headlines. The interesting part is where the seven systems still don't agree, and three disagreements have real weight. Each one is a choice that changes how the system behaves day to day, not just how it reads on a docs page.

Instruction memory vs. fact memory

Claude Code is built around instructional memory. How to behave, repo conventions, workflow rules. Mem0 is built around factual memory. The user likes X, the project decided Y. Letta spans both through blocks. Most other systems implicitly treat memory as facts and leave instructions to the system prompt.

<!-- CLAUDE.md: instructional memory, loaded into every session -->
Use pnpm, not npm.
Write tests with Vitest, not Jest.
Never use em dashes in copy.

# Illustrative pseudocode: factual memory, extracted from conversation and scoped to a user
m.add("User prefers TypeScript over JavaScript", user_id="alice")
m.add("Project 'Bakin' chose AntflyDB over LanceDB", user_id="alice")

This matters. Retrieval logic, promotion rules, and editability requirements are genuinely different between the two. A preference can be superseded. A workflow rule has to fire every turn. Systems that collapse them struggle in both directions. Instructions get forgotten in long tasks. Facts get rigidly reapplied when they shouldn't.

Runtime-owned vs. store-owned governance

Where does lifecycle policy live.

Store-owned. Mem0 and Vertex put promotion, supersession, TTL, and extraction inside the memory store. The agent calls add and search and the store handles the rest.

# Illustrative pseudocode: the store decides what to extract, dedup, and expire
m.add("User prefers concise responses", user_id="alice")
results = m.search("response style", user_id="alice")

Runtime-owned. LangGraph and AutoGen put policy in the runtime. You decide when to write, what to extract, how to rerank.

# Illustrative pseudocode: you own the extraction, the write, and the query shape
if should_remember(message):
    await store.aput(
        ("memories", user_id),
        str(uuid4()),
        {"text": extract(message), "kind": "preference"},
    )

memories = await store.asearch(("memories", user_id), query=current_context)

Hybrid. Letta and Claude Code do both. Some lifecycle is agent-driven (agents editing their own blocks), some is infrastructural.

# Illustrative pseudocode based on Letta's current memory editing tools.
memory_replace(
    block_label="persona",
    old_text="Generalist engineer.",
    new_text="Senior engineer. Prefers Go. Concise responses.",
)

Neither is wrong. Store-owned is easier to adopt, harder to customize. Runtime-owned is more flexible, more work per deployment. The choice has implications all the way up to who owns correction flows and provenance visibility.

User-facing visibility

Widest gap in the field.

ChatGPT has polished end-user memory controls. Inspect, delete, disable, temporary mode. Mem0's OpenMemory UI markets browse, tag, and manage flows. Letta exposes blocks as inspectable objects. Vertex exposes revisions and IAM but is operator-facing, not user-facing. AutoGen, LangGraph, and the Agents SDK leave user-facing memory UX almost entirely to the application developer.

This is one of the first things I wanted to get right when building Bakin for OpenClaw. An agent memory system is only useful if you can see what's in it and edit it.

The forcing function came from a rename. The project started as "beacon." It got renamed to "bakin." My main agent, Roscoe, could not let beacon go. Ran in loops. I tried AGENTS.md. Skills files. Escalations in the system prompt. Same behavior every time.

So I built a view over the memory store. Session, markdown, and durable tiers in one place, searchable and editable. First query: "beacon."

Dozens of beacon references sitting in the durable store across scout, pixel, patch, basil, nemo. Auto-managed by beacon doctor. Pinned, weighted, surfacing every turn. Prompt engineering was never going to outrank that. Once I could see it, the fix was thirty seconds.

Infrastructure has advanced much faster than the interfaces. This one's not close.

The five cracks everyone has

Across the seven systems, the same gaps show up repeatedly.

Scope and isolation. Most systems have scopes. Fewer make scope resolution, fail-closed behavior, and cross-scope leakage inspection truly first-class.
Lifecycle and supersession. Most systems can store a new fact. Fewer can cleanly model "this replaces that," mark records stale, surface conflicts, and preserve visible revision history.
Provenance and "why recalled." Users and operators want to know where a memory came from and why it surfaced. Almost nobody exposes this as a queryable or visible object.
Summary artifacts treated as canonical. Compactions, reflections, and synthesized briefs often get reused as if they were primary sources, even when they have already dropped nuance or preserved the wrong branch.
Promotion without authority checks. Hypotheses, drafts, tool paraphrases, and model inferences get promoted into durable memory without enough checks on truth status, source authority, or applicability.

These aren't nice-to-haves on someone's backlog. They're where many of the production bugs that matter actually live.

What breaks when the cracks open

Take each crack in turn and the recurring failure modes fall out immediately. The same shapes show up across systems, which is the strongest evidence the cracks are structural, not implementation details.

Scope crack, open. Silent scope-widening pulls another user's preference, another project's decision, or another agent's local note into the wrong session. The scope filter matched nothing, a fallback widened the search, something returned, and it was wrong. Nobody noticed until a cross-customer note surfaced. Or a "temporary" session still contributes to recall traces, flushes, or dream promotions, so an ephemeral conversation quietly changes future behavior. That one destroys user trust faster than any other failure in this category, because it breaks an explicit promise.

Lifecycle crack, open. The user changes a standing preference and the old one still ranks highly, so the agent keeps writing in the old style. The project architecture shifted, but last quarter's pinned decision brief is still in the retrieval set, and the agent implements reverted patterns. The source file changed and the memory didn't, so the agent cites a stale quote as current truth. The user says "that's wrong" and the agent adapts the current turn, but the underlying record stays untouched, so the same wrong fact resurfaces next week. That last one is what users mean when they say "I feel like I'm telling it the same thing over and over." Same shape as the beacon-to-bakin story earlier. The rename landed in the prompt. The corrections landed in the current turn. The durable store was never touched, so "beacon" kept surfacing every run until I opened the memory view and cleaned it out.

Provenance crack, open. The extraction pipeline writes an inferred claim as a fact, and months later the system "remembers" something the user never said. A brainstorm-phase hypothesis ranks beside a user-confirmed preference and gets cited as settled knowledge. A tool result gets paraphrased into something stronger than the source supported. The memory is retrievable, but nobody can tell where it came from or whether to trust it.

Summary crack, open. A long session gets compacted after an earlier plan was rejected. The summary preserves the original plan and silently drops the reversal. Future turns revive the wrong branch. Recursive summary-of-summary cycles smooth away the hedged qualifiers until the brief reads cleaner than the evidence ever did. The same synthesized summary gets restated across enough turns that agents stop checking primary sources, and the summary becomes de facto canonical. This whole category is one mistake in different clothes. A derivative artifact treated as canonical.

Promotion crack, open. A temporary working note appears often enough to get promoted, and a one-off debugging assumption becomes a durable rule. Semantic similarity ranks an older but textually similar note above the newer explicit decision record. A dreaming pipeline promotes a popular but misleading snippet because it scored high on frequency and nothing else checked its authority.

The whole corpus of memory bugs compresses to one line.

"Wrong thing" is provenance failures plus summary artifacts being treated as primary truth. "Wrong status" is lifecycle. "Wrong scope" is scope. "For too long" is what happens when correction, supersession, and expiry fail. Promotion without authority checks is how the wrong thing gets durable status in the first place. Those five cracks are the mechanisms behind the whole failure pattern.

What the cracks tell you to carry on every record

Read as design constraints, the cracks give you the minimum shape I'd want in any serious durable memory record.

class, because a hypothesis should retrieve differently from a fact
scope, because fail-closed is the only defense against silent cross-contamination
origin, because user-stated and model-inferred need different trust weights
status of active | stale | superseded | conflicted | archived, because "it's still in the store" is not the same as "it's still true"
confidence, because not all memories deserve equal retrieval weight
freshness or lastValidatedAt, because memories go stale even when correct
sourceRef with source id, path, version, and quote span, because without provenance you cannot debug anything
supersedes and supersededBy, because "this replaces that" is the most common operation you'll want later
createdBy and createdFrom, because knowing who wrote a memory is a prerequisite for deleting it correctly

Put them together and you get the minimum shape I'd want every serious durable memory record to carry.

interface DurableMemory {
  id: string;
  class: "fact" | "hypothesis" | "instruction" | "preference";
  scope: { userId: string; projectId?: string; agentId?: string };
  origin: "user-stated" | "model-inferred";
  status: "active" | "stale" | "superseded" | "conflicted" | "archived";
  confidence: number;
  lastValidatedAt: Date;
  sourceRef?: {
    docId: string;
    path: string;
    version: string;
    quote: string;
  };
  supersedes?: string;
  supersededBy?: string;
  createdBy: string;
  createdFrom: "extraction" | "user-input" | "tool-output";
}

None of these are theoretical. Each maps to a specific failure above that omitting it causes.

Why this frame matters

The field agrees more than you'd think on the shape of memory. Multi-tier. Typed. Scoped. Retrieval as an explicit step. Policy on promotion. Where it still splits: governance, the specifics of that promotion policy, and how much of the memory logic belongs in the runtime versus the store.

If you're evaluating or building one of these systems, the useful question isn't "does it have memory." Across the seven systems above, the same five decisions recur. The useful question is which of the five cracks is still open, because that's where you'll pay the failure tax.

Work Ethic Is Still the Moat

Mark Hayden — Tue, 21 Apr 2026 12:00:00 GMT

We are in a strange moment.

The tools are absurdly good. One capable person (key word being capable) can legitimately prototype a product in a weekend, ship something real in a week, automate work that used to take a team, and learn faster than ever.

Last night I was playing with Seedance 2.0 (absolutely nuts if you have not tried it) and was able to generate five clips, stitch them together, add a few transitions, do a little slicing, dicing, and tweaking in DaVinci Resolve, and come out with a fifteen-second intro for Bakin I actually like. Not perfect. Not award-winning. But thirty minutes got me to an end result that used to take months of storyboarding, sketching, and modeling. That is not a stunt. That is the new baseline for a lot of what used to require a team.

Some of what is happening right now is genuinely incredible. Some of it is complete bullshit.

There is more smoke in the air than I can remember. Every single day there is a new "this changes everything" post, a new "we no longer need [insert profession]" take. The latest is Claude Design supposedly killing designers. Newsflash, it will not. Everything cannot change everything every day. The clickbait madness has to stop at some point, because it is drowning out the real signal. Fake builders, demo merchants, people confusing a clean UI with a real business, people confusing generated output with understanding, people acting like a prompt is the same thing as judgment, people trying to sell an "easy button" that somehow only they have access to.

They do not. That is the point.

The tools are here. They are broadly available. The advantage is real, but it is not exclusive, which means the old differentiator did not go away. If anything, it got more important.

Work ethic is still the moat.

Not work ethic in the performative sense. Not staying busy for optics. Not bragging about long nights for LinkedIn. Not burning yourself out so someone else can squeeze another quarter of output out of you. I mean the quieter thing. The harder thing to fake. The willingness to show up consistently, learn aggressively, do the boring reps, fix what is broken, stay with the problem when it stops being fun, and keep investing in yourself when nobody is forcing you to.

That is still the thing.

Access is not skill

AI absolutely raises the floor. It makes it easier to start, easier to draft, easier to ship a rough first pass, easier to explore an unfamiliar domain, easier to automate tedious work, easier to become dangerous quickly. That is a real advantage, and pretending otherwise is lazy.

But access to a tool is not the same as mastery of one, and easier access to leverage does not eliminate the value of effort. It just moves where the effort matters.

The full potential of any of this only opens up through time, reps, and genuine curiosity. If the only way you learn these tools is YouTube explainers and Reddit threads, you are perpetually one cycle behind people who are actually using them to ship. At the pace the frontier is moving, one cycle behind is a lot. You have to invest. You have to play. You have to be weird with it, break things, push past the happy path, and stay interested long after the novelty has worn off. That is where taste comes from. That is where the real leverage compounds.

When building gets cheaper, follow-through matters more. When answers get cheaper, judgment matters more. When everyone can generate, taste matters more. And when everyone can look productive, the people who actually are productive start to separate themselves at a speed that feels uncomfortable to the people standing still.

I have had a lot of conversations about that last part recently. The gap is widening in a way people can feel day to day. It changes how teams get staffed, how trust gets allocated, who gets handed the ambiguous problem, who gets handed the escalation. It is rewiring business dynamics in real time, and nobody is going to put out a press release about it. It is just happening.

AI is not the moat. Access to tools is not the moat. Prompt fluency by itself is not the moat. The moat is being the kind of person who does not stop at the draft.

Reality still has edge cases

One reason the hype gets so exhausting is that it keeps pretending the hard part was typing. It usually was not. The hard part is knowing what should exist, knowing what not to build, handling the ugly edge cases, the hidden dependencies, the customer who phrases the problem badly, the second-order effect, the rewrite after the first version taught you something inconvenient. That is why a lot of AI output looks impressive right up until it meets reality.

The research keeps echoing the same thing. In July 2025, METR ran a randomized controlled trial of 16 experienced open-source developers working on their own repos. Developers using early-2025 AI tools took 19% longer to complete issues, even though they expected the tools to make them significantly faster. A follow-up in February 2026 showed the picture had flipped. The same cohort was now measurably faster with modern tooling. That is a tiny sample, so I would not treat either reading as gospel. But the directional point is hard to dismiss, and I do not think the two readings contradict each other. The tools genuinely got better, fast. The first study still pointed at something more interesting: a powerful tool does not automatically translate into better outcomes, even for smart and experienced people. The operator matters. The workflow matters. The standards matter. Those variables did not become less important when the tools got better. They became the thing that separates the 18% faster from the 18% slower.

Microsoft Research presented a related study at CHI 2025: a survey of 319 knowledge workers where higher confidence in GenAI was associated with less critical thinking, while higher self-confidence was associated with more. That feels like the whole game in one sentence. Strong people use the tool as leverage. Weak habits use the tool as an escape hatch.

That is the gap.

What actually separates people

If you dig through the conversations around AI, startups, and engineering right now, the same themes keep surfacing from people who are actually building. In a recent r/SaaS thread, the strongest comments were not saying AI is fake. They were saying the moat moved from "can you build it?" to "do you understand the problem deeply enough to solve it better than the six identical competitors?" One commenter put it more directly than I could:

That is exactly right, and it shows up everywhere once you start looking for it.

Mario Zechner gave a talk at AI Engineer Europe that landed on almost the same point from a different angle. His closing lines stuck with me: think about what you are building and why. Learn to say no. Fewer features, but the right ones, polished. Friction builds understanding and taste. Be in the code. Non-critical code, go nuts. Critical code, review every line. None of that is anti-AI. It is the opposite. That is a builder who has actually shipped things telling you what the tool does and does not remove. It still does not remove taste, discipline, or the part of the job that requires you to be in the work.

The future is probably not "manual purists beat everyone." The future is much more likely that disciplined people with strong fundamentals use these tools to pull away even harder from people who are using them to avoid doing the work.

Effort, not optics

This part matters, because otherwise the whole argument gets flattened into hustle nonsense.

Work ethic is not blindly over-identifying with your employer. It is not a moral obligation to make yourself useful to people who do not value you. It is not proof of virtue just because someone stayed online late. There is a reason so many people react badly to the phrase. A lot of them have watched "work ethic" used as a euphemism for being underpaid, overworked, and guilted into performing loyalty for institutions that would drop them in a week. That criticism is fair, and that is not the version worth defending.

It is also not the loud version. In my experience, the people who talk the most about how late they stayed, how many hours they put in, and how hard they ground last week are almost never the ones actually grinding. Real work ethic usually comes from the sleepers. The quiet ones who are just methodically crushing what is in front of them and do not particularly need anyone to notice. They are not doing it for a manager, a review cycle, or a LinkedIn post. They are doing it because there is something internal driving them. Their own well-being. Their own curiosity. Their own standard for what they will accept from themselves. That fire does not announce itself, and it does not need external validation to stay lit. It just keeps going.

That is the version worth defending. The standard you hold yourself to. The effort you invest in your own capacity. The habit of learning. The willingness to get better at things that matter. The refusal to become passive while waiting for permission, validation, funding, management, or rescue.

You do not build that for shareholders. You build that because it makes you more dangerous, more capable, and more free.

Find that easy button yet?

There has never been an easy button that was uniquely yours, and there never will be. There will be windows, timing, luck, and people with better networks, better health, better opportunities, and cleaner starts than you. Plenty of people will get ahead for reasons that have nothing to do with merit. That has always been true.

The point is not that hard work guarantees outcomes. It does not. The point is that work ethic is still the highest-leverage variable that stays under your control for a long time. It compounds. It sharpens taste. It builds judgment. It earns trust. It lets you capitalize when the window opens. It gives the tools something solid to attach to.

Without that, all this new leverage just turns into more noise. With it, you really can 10x, 20x, even 100x yourself.

That is the real opportunity right now. Not to posture. Not to cosplay as a founder. Not to collect prompts. Not to wait for somebody else to pick you. To build, to learn, to get unreasonably good, to use the tools without hiding behind them, and to invest in yourself before anyone else decides to.

This moment is wide open. The ability to make crazy things is more accessible than it has ever been. The barrier to trying is lower. The excuse set is smaller. The distance between idea and first version is collapsing in real time. That should not make you softer. It should make you more dangerous.

Borrowed competence

If everyone has leverage, the differentiator becomes who actually leverages it. If everyone can start, the differentiator becomes who keeps going. If everyone can make noise, the differentiator becomes who can still produce signal.

Work ethic is not becoming obsolete. It is becoming easier to see. And in a world full of borrowed competence, polished demos, and low-friction excuses, that may be the most valuable moat left.

Like an Onion, Agentic Memory Has Layers

Mark Hayden — Mon, 20 Apr 2026 12:00:00 GMT

Hacker News: “ChatGPT forgets everything every time you close the tab.” LangChain forum: “What’s the difference between checkpointing and long-term memory?” Product manager: “We need memory for our agent.” CTO across the hall: “We already have memory. We use a vector database.”

Same word. Not even close to the same thing.

This is the first expensive mistake in building agentic systems: treating memory as a single concept. It is not. It is at least four separable things that got collapsed under one word, and you cannot make good design decisions, or even use these systems well day to day, until you pull them apart.

The four things we keep conflating

1. The context window

The prompt you hand the model for a single turn. Not memory. More like RAM. It is working space, it evaporates the moment the call returns, and every LLM call rebuilds it from scratch.

When users complain the model "forgot" something from earlier in the chat, they are almost always noticing that the context window got too full and older turns fell out. That is not a memory failure. It is a capacity failure. Fixing it requires eviction policy, not persistence.

2. Session state

What lets a chat feel like a chat. The transcript. The record of what was said, what tools ran, what the model decided. It lives on disk or in a database somewhere, and the system reads enough of it back before each turn to reconstruct context.

ChatGPT's history sidebar is this. LangGraph's checkpoints are this. OpenAI's conversation-state model is this. Without it, every message is a cold start.

Session state is replay infrastructure. It is not the same as remembering things about you across conversations.

3. Compiled context

This is the in-between layer almost nobody names, but it is where most of the leverage is. Summaries, cached packs, handoff bundles, pinned project notes, policy briefs, distilled artifact digests. Anything pre-processed into a form the model can ingest efficiently belongs here.

Anthropic now makes prompt caching a first-class API primitive, and Google does the same with context caching. Claude Code's CLAUDE.md is compiled context. A project brief you hand a new agent is compiled context. A compaction summary is compiled context.

Compiled context is not raw history and is not durable typed memory. It is condensed reference material designed to slot into the prompt cheaply and repeatedly.

4. Durable memory

What people actually mean when they say "the agent should remember me." Cross-session facts and preferences. "The user prefers concise answers." "We decided last month to use Postgres, not Mongo." "This project's deadline is Friday." This layer outlives the conversation. It is what ChatGPT's memory feature is trying to do at the product level, even if OpenAI describes it in terms of saved memories and chat-history-derived personalization rather than a typed developer-facing store. It is also what Mem0, Letta, and Vertex AI Agent Engine Memory Bank are built around.

Durable memory is the only one of these four that actually has identity across time. It is also the one most systems get wrong, because they treat it as a bucket to dump summaries into, rather than a typed, scoped, revisioned store.

Why the conflation is expensive

If you believe memory is one thing, you build one system. You pick a vector database, call it memory, and move on. Then you discover:

Your "memory" makes the model slower because every turn re-embeds the whole transcript.
Your agent "remembers" things from conversations the user thought were private.
You cannot tell whether something the agent claims to recall came from the current thread, last week's conversation, a compacted summary, or a stray scratchpad.
When the user says "that's wrong, forget that," you have no idea which layer to correct.
When the user starts a new project, every memory from every other project floods in.

These are not retrieval problems. They are category errors dressed up as retrieval problems.

The tier model

The cleanest way to hold all four in your head is as tiers, with increasing durability and decreasing volume.

Information flows both ways. A preference observed in working context gets promoted to durable memory. Durable memory gets compiled into a project pack that gets pinned into working context at the start of the next session. A session gets compacted into a summary that may or may not deserve promotion to durable. Whether it does is a governance decision, not a storage decision.

This is MemGPT's big insight, reframed. Your LLM is a CPU, your context window is RAM, and everything else is storage at different tiers with different access patterns.

Two orthogonal dimensions: class and scope

Once you have tiers, you need two more distinctions.

Class, the shape of a memory

A user preference is not the same thing as a project decision, an artifact digest, or a speculative hypothesis. Each has different fields, lifecycle rules, and retrieval semantics.

A working taxonomy:

fact — a semantic claim about the world
preference — a user's standing choice
decision — a project ruling with rationale and alternatives considered
artifact — a reference to a canonical source with version
procedure — a how-to or runbook
policy — a rule that must be followed
hypothesis — a speculative claim, explicitly non-authoritative
episode — a time-bound event trace

Why this matters: retrieval should behave differently for each. A decision's rationale should be preserved in full. A hypothesis should never surface as if it were fact. A preference should supersede its predecessor instead of appending. An artifact should know its source version so you can invalidate it when the source changes.

Scope, the boundaries on visibility

Scope is orthogonal to class, but it is not a single linear ladder. It is better modeled as multiple intersecting dimensions, each with its own hierarchy. Any given memory belongs to a set of coordinates, and effective retrieval is the intersection across them.

Take an image-generation agent named Pixel. Her memory lives along several axes at once.

Craft/domain axis. Camera technique, composition rules, color theory. Stable across every job she ever takes.
Agent axis. Pixel's own operating style, tone, voice. Distinct from the other agents in the same workspace.
Project-type axis. Sports photography and portrait sessions have different conventions, lighting defaults, pacing. A memory good for one is wrong for the other.
Client/relationship axis. What this specific client likes, what they rejected last time, their brand guidelines.
Task/thread axis. What just happened in the current conversation.
Identity axis. User → team → workspace → org.

When Pixel retrieves memory for a turn, the relevant set is not "everything under a node in a tree." It is the intersection of her current coordinates along every axis. This agent + this project-type + this client + this task, plus the standing craft memory that applies to all of them.

Claude Code's precedence model is a simplified linear version of this, where more specific memory files override broader ones (for example, project-level guidance can override user-level guidance). Letta's attachable memory blocks are closer to the real shape. Each block is a coordinate and an agent's effective memory is the union of the blocks currently attached to it. Google Vertex AI Memory Bank formalizes the access side with IAM conditions over scope expressions.

The practical rule every serious system ends up at. Scope must fail closed. If you cannot resolve a retrieval's coordinates, refuse, do not widen. Most privacy leaks in memory systems are silent scope-widening in disguise. "Just one more fallback" becomes "why did the agent just tell me about my coworker's project?"

Classes and scopes interact: conditional memories

The class/scope split is useful, but in practice it is not fully orthogonal. Some of the most valuable memories are conditional. Their applicability depends on the current coordinates along other axes.

"My boss prefers green" is rarely universally true. It is "my boss prefers green when we're pitching an environmental client." The same human has different preferences for retail pitches, healthcare, legal. Surface that memory in the wrong context and it becomes noise or, worse, a confident wrong answer.

This shows up everywhere real agents work.

A decision that only applies when the project type is X.
A policy that only fires when handling a certain data category.
A procedure that varies by client.
An episodic pattern that only generalizes within a narrow condition.

Current systems handle this inconsistently. Mem0 attaches arbitrary metadata and lets you filter at retrieval time. LangGraph uses composite namespaces. Letta's attachable blocks are a rougher analogue: attach the "environmental-client" block to the right agents in the right projects, and its contents become active. Google Vertex AI Memory Bank supports scope expressions that can encode conditions directly.

The underlying insight. A memory's retrievability is often a predicate over the current context, not a single label. A cleanly typed preference can still need "applies-when: client.type = environmental" attached to it. Good systems make that predicate explicit and inspectable. Weak ones leave it implicit in how the retrieval query happens to be built. Which is exactly how you get a "preference" that is accidentally right most of the time and confidently wrong the rest.

Pipelines are not tiers

One more distinction is worth naming. Tiers (working, session, compiled, durable) are where memory lives. Pipelines are the processes that move information between tiers, and they are orthogonal to the tier model.

Two pipelines worth knowing by name because they show up under different labels in almost every serious system.

Dreaming. A process that runs over raw session traces and recall evidence and produces compiled artifacts: summaries, reflections, thematic groupings, consolidated briefs. The output of dreaming is compiled context, not durable memory. Its job is to produce better inputs for the next tier up. The Generative Agents paper calls this "reflection." Vertex AI Agent Engine Memory Bank frames a similar idea as consolidation.

Habit formation. A separate process that decides what, if anything, from the compiled tier should become durable. This is where reinforcement curves, frequency thresholds, confidence gates, and conflict resolution live. When a preference is observed repeatedly, or a decision is confirmed, a promotion pipeline writes it into durable memory with provenance, class, and supersession links. Mem0's information-extraction, conflict-resolution, and storage pipeline is this. Claude Code's auto-memory is a lightweight version.

Other common pipelines: Compaction (session → compiled summary), extraction (raw transcript → typed durable memory), reflection (episodic traces → higher-level insight), invalidation (marking durable memories stale when source changes or supersession is detected), and revalidation (checking whether a memory's source version still matches).

The important move is to keep tiers and pipelines cleanly separated in your head. A tier is a place. A pipeline is a policy. Most memory bugs are either a tier confusion (treating a compaction summary as durable truth) or a pipeline confusion (letting frequency alone justify durable promotion, which is how systems end up "remembering" their own dream output as fact).

Why this frame matters

Almost every serious problem in agent memory turns out to be one of four errors in disguise.

Tier error. Treating a compaction summary like durable truth. Letting the context window double as the memory model. Dumping raw transcripts into the durable store.
Class error. Storing a hypothesis next to a fact with equal retrieval weight. Treating all memories as flat text instead of typed records.
Scope error. Pulling the wrong project's decisions into the current thread. Letting a private scratchpad promote into shared memory.
Lifecycle error. Never invalidating. Never superseding. Never noticing the source changed. Never distinguishing stale from current.

These are the categories. Storage and retrieval are tactics you choose after you have made these decisions, not before.

Where this goes from here

The field agrees more than you might think on the shape of memory: multi-tier, typed, scoped, conditional. Where it diverges is on governance, promotion policy, and how much of the memory logic belongs in the runtime versus the store.

That is the thread I want to pull on next. What Letta, LangGraph, Mem0, Google Vertex AI Memory Bank, Claude Code, AutoGen, and OpenAI's Agents SDK all independently discovered about how to build these layers, and where they meaningfully diverge when it actually matters.

If you are building any of this, the single most useful thing you can do right now is stop saying "memory" and start saying which of the four you mean.

We’re Using AI to Replace Learning, Not Transform It

Mark Hayden — Sun, 19 Apr 2026 12:00:00 GMT

My default position on AI and those just getting their careers started has not exactly been optimistic.

I have spent a lot of time thinking it would flatten curiosity, replace effort with prompts, and train people to expect magic on demand. If every hard problem gets met with an instant answer, what happens to patience? What happens to struggle? What happens to the part of learning where you have to sit in confusion long enough for something to click?

That version of the future still worries me.

It is not hard to imagine a generation that becomes less capable because they have been taught to demand a response to every blank page, every equation, every essay, and every question, without ever understanding what it takes to get there. If the machine always rescues you, eventually you stop building the muscles required to rescue yourself.

That is the cynical view, and for a while I thought it was the only honest one. Lately I have been thinking there is another version of this story, and it is a lot more interesting. The best case for AI in education is not that it makes learning easier. It is that it makes learning fit.

The boring science nobody argues with

None of this is a new argument. It has just never been practical at scale.

In 1984, Benjamin Bloom published what became known as the "2-sigma problem." Students who got one-on-one tutoring with mastery checks along the way dramatically outperformed students in a standard classroom. That is not a small lift. Roughly the gap between an average student and one in the top 2%.

The follow-up research never fully reproduced that leap, but the direction held. VanLehn's 2011 review found that even intelligent tutoring software moved students from average toward roughly the top quarter of the class. Human tutors landed in the same neighborhood. Dunlosky's 2013 review catalogued what actually moves the needle and kept landing on the same boring winners. Retrieval practice. Spaced repetition. Interleaving. Worked examples. Not lectures. Not worksheets. Not personality quizzes.

The bottleneck was never knowing what works. The bottleneck has always been attention per student. One teacher cannot simultaneously tell thirty different stories about the same concept, pace thirty different timelines, and diagnose thirty specific misunderstandings in real time.

AI might be the first tool with a realistic shot at that kind of scale.

Nobody is a "visual learner"

One thing out of the way first, before this starts sounding like a BuzzFeed quiz.

The idea that some people are "visual learners" while others are "auditory" or "kinesthetic" is not supported by the evidence. The Association for Psychological Science put it plainly. There is no credible case that matching teaching style to a student's preference improves learning. A 2008 paper walked through the research and found the effect was close to zero. A 2017 follow-up put it even more bluntly. The title was "Stop Propagating the Learning Styles Myth."

Around 90 percent of teachers still believe in it anyway.

So when I say "fit," I do not mean matching some fixed sensory preference in a student's head. I mean something more grounded. Meeting the student at their actual prior knowledge. Pacing to their actual speed. Swapping the representation of an idea when the first version fails to land. Content determines the best representation, not the learner's self-reported type. Geometry needs diagrams. Poetry needs sound. History needs narrative. None of that is controversial, and none of it requires the learning-styles myth.

What "fit" actually looks like

For as long as school has existed in its modern form, students have mostly been asked to adapt themselves to the delivery mechanism. Same lecture. Same worksheet. Same pacing. Same explanation. Same tests. Some students happen to match the teacher and thrive. Some students survive. Some students quietly decide they are bad at a subject when really they were just handed it in the wrong language.

That part matters more than people admit.

A good AI-driven learning system does not just re-explain the material. It checks understanding constantly. It adapts difficulty in real time. When the first example does not land, it pulls up another one from a different angle. It drops the level when the student is lost and lifts it when they are bored. It catches the specific error, like "you forgot the negative sign in step two," instead of just marking the whole answer wrong.

That is the part that keeps sticking with me.

So much of education gets blamed on the student when the real failure is the delivery system. We treat mismatch like inability. We confuse boredom with incompetence. We let one bad fit with a teacher shape an entire relationship to a subject.

I’ve generally done well in school and beyond. Geometry was the one class that never really clicked. It wasn’t a lack of effort, and it wasn’t ability. The pace and the delivery just never lined up. Looking back, there was no mechanism for translating the material into a form that actually worked for me. No way to try the concept from a second angle. No way to check whether the first angle had even landed. No way to notice that I had been quietly lost since the first two weeks of the semester.

That is not a small thing.

There are probably millions of people walking around with fake conclusions about themselves because school delivered a subject badly and called the result merit.

The uncomfortable part

The current evidence on AI in education is not a clean story. It is also not all on my side.

A 2025 MIT Media Lab study put students through essay-writing tasks while measuring their brain activity with EEG. The group using ChatGPT showed the weakest neural connectivity in the study. Their sense of ownership over their own work dropped. By the final session, many were mostly copy-pasting. The paper describes the LLM users as "consistently underperforming at neural, linguistic, and behavioral levels."

Wharton researchers, led by Bastani in 2024, tested GPT-4 as a math tutor with Turkish high-school students. In-session performance jumped by roughly 48 percent with the AI available. Then they pulled the AI away and sat the students for a closed-book exam. Performance dropped about 17 percent compared to students who had never used it. Classic pattern. Scaffolding becomes a crutch.

In a 2025 survey of 319 knowledge workers, Microsoft Research and Carnegie Mellon found that the more confident people were in the AI, the less critical-thinking effort they applied, and the more homogenized their output became.

And Khan Academy's own 2024 efficacy report on Khanmigo, plus follow-up analyses like this one in EdWeek, show some engagement gains and mixed learning outcomes. Not the revolution the initial pitch suggested.

Read together, those studies do not kill the thesis. They sharpen it. The same thing that makes LLMs feel magical is the thing that undermines learning. Fluent, confident prose on demand. Instant answers. No friction. That is the opposite of what the research says actually works. If the AI hands the answer over too quickly, the student never builds the thing in their own head.

So the real question is not "does AI help learning." It is whether we can build products that enforce retrieval, productive struggle, and mastery instead of shortcutting them. Those are very different systems. Most of what is shipping today is the wrong kind.

The pattern is older than AI

Audrey Watters has been writing about this for years in Hack Education and Teaching Machines. Every wave of educational technology since B.F. Skinner's teaching machines in the 1950s has promised some version of "personalized learning at scale" and has mostly delivered worksheets in a new package. The warning in her work is not that the promise is fake. It is that the hype keeps running about a decade ahead of the reality, and the casualties in between are real kids.

Dan Meyer, who has spent a long time thinking about how people actually learn math, has made a related point about current AI tutors. They default to hint-giving and answer-shaping that short-circuits the productive struggle that is where math learning actually happens. A fluent, confident chatbot that helps a student past every moment of discomfort is not a tutor. It is an escape hatch.

None of that means the idea is fake. It means the first generation of products is mostly missing the point.

The garbage around AI is the bigger threat

Honestly, the bigger danger to younger generations is not AI in classrooms. It is the garbage surrounding AI everywhere else.

The real poison is clickbait. It is bullshit videos, fake gurus, thinly disguised vaporware, and an endless stream of people promising impossible outcomes because hype pays better than honesty. That ecosystem trains kids to chase shortcuts, confuse marketing with substance, and expect results without contact with reality. It is not education. It is monetized delusion.

And the effect of that stuff is not abstract. It changes plasticity. It changes what people believe work even is. If all you see is "build an app in one prompt," "make passive income with no skill," or "launch a company this weekend with AI," eventually your brain starts calibrating itself to fantasy instead of effort.

I see it up close too. Someone has an app idea. Might even scrape together a little bit of funding. Goes hard on mockups and designs. On paper, that all sounds great. In reality, it is window dressing. No grasp of what a real payment system involves. Security is an afterthought. No plan for maintaining the thing after launch. Just enough surface area to look real to someone who does not know what they are looking at. And if one of these does make it to production, the risk lands on actual customers. Some of these ideas can absolutely come to life. I just see far more "this is going to be a disaster" than "this is a real thing."

And the depressing part is that plenty of companies will still reward that kind of thing because they are also being trained by the same hype loop. The issue is not ambition. The issue is a culture that keeps teaching people presentation is the product.

That is where I think a lot of the real danger lives. Not in AI as a learning tool, but in a broader social willingness to lie, posture, and prey on the uninformed because there is always money on the other side of the click.

Where this leaves me

My hot take is not that AI will save education. It is that it might finally expose how bad we have always been at personalizing learning, and how much wasted human potential has been written off as laziness, disinterest, or low ability when the real issue was poor translation.

The risk is still real. The magic-thinking problem is still real. The temptation to use AI as a substitute for effort is still real. The current crop of products mostly makes those risks worse, not better. If the design goal stays "engagement and fluent answers," we will end up with a generation that is more entertained, less capable, and more convinced they understand things they do not.

But if we are serious and honest about this, there is another possibility on the table. A system that enforces retrieval, insists on productive struggle, gates progression on actual mastery, and adapts the representation of a concept until it lands. Not a chatbot that does homework for you. A translation layer between the curriculum and the student.

Maybe AI does not make people dumber than a brick.

Maybe, if we build it with any respect for the actual science of how people learn, it makes it a lot harder for broken systems to keep convincing smart people they are.