David Hillier

Insight

January 21, 2026

1/21/2026

LLM Recall

Designing AI memory isn’t about collecting more data — it’s about structuring it for recall. Blend precise Q&A with narrative context to create avatars that feel coherent, reliable, and human.

Onboarding memory capture Learnings

Every conversation you’ve ever had depends on memory you didn’t consciously supply. For example, when we meet someone new, we don’t hand them a file called “me.” Our identity emerges gradually, from fragments of memory, shared context, assessment and what we choose to reveal over time. Our brains are machines for assembling a person from incomplete data. Memory isn’t storage, it’s reconstructive flow - a working model from context, habit and partial truths.

Capturing the essence of this in an app is daunting; but we’re working with a partner to do this right now, building the system into a personalised avatar companion. One part of onboarding is capturing “memory”: the background knowledge you collect upfront that becomes the avatar’s identity, plus what the user shares in conversation. Our focus is practical UX: how to ask for that information in a way that people will actually finish, and will happily repeat when they want to create more avatars.

It sounds simple until you try to make it reliable. Because capturing data and capturing data that the LLM can reliably retrieve later are two different problems. You can collect a lot of information and still end up with an avatar that answers inconsistently, or only answers correctly when the user asks in the exact same phrasing that you stored.

This is where design and engineering ended up meeting in the middle. UX wanted an onboarding flow that feels like a conversation, not a form. Engineering had two needs: prove the format holds up under recall testing, and make sure the format doesn’t paint us into a corner structurally. Q&A is straightforward: you naturally get searchable key/value pairs. Open-ended prompts are messier: they’re better at capturing “glue” but they force an early decision about data structures, raw text vs extracted fields vs summaries because you can’t rely on simple lookup semantics. The goal was to pick a path that preserves flexibility, not one that only works because we contorted the memory system around it.

So we built a quick evaluation suite to pressure-test one UX decision: should we gather these “memories” through lots of discrete Q&A pairs, or through fewer open-ended prompts that encourage richer answers?

The two pathways we tested

We tested two onboarding formats for capturing the same underlying “memory” set.

Structured Q&A

This is the form-like approach: you ask a narrow question and store a narrow answer. It naturally produces a clean key/value representation, which is attractive from an implementation standpoint.

Example:

“What is your name?” → “Han Solo”
“What ship do you fly?” → “The Millennium Falcon”
“Who is your co-pilot?” → “Chewbacca”

Open-ended prompts

Here we ask a small number of broader prompts designed to pull multiple facts in a single response, plus the context around them. The output is less obviously structured, but it captures the “glue” that tends to show up in real conversations: relationships, sequencing, motivations.

Example:

“Tell me about your ship and partner” → “I fly the Millennium Falcon, a modified Corellian YT-1300 I won from Lando in a sabacc game. My co-pilot is Chewbacca, a Wookiee from Kashyyyk. We’ve been running jobs together for years…”

UX-wise, the difference is immediate: thirty questions feels like work; five prompts feels like a conversation. Engineering-wise, the trade-off is also immediate: Q&A gives you searchable fields by default, while open-ended responses force you to decide how (or whether) to extract structure.

Our working assumption going in was that open-ended prompts would outperform on questions that require synthesis, because the connective tissue is present in the stored text. The concern was whether that would come at the cost of worse direct recall because some onboarding facts still need to come back perfectly.

Why we used Gemini

From earlier work with Google, we’d seen Gemini perform extremely well on retrieval from large contexts, situations where you push in a big dataset and it can still return very specific pieces of information. So for this experiment we built the suite around google/gemini-3-flash. We also used Vercel’s AI Gateway and a standard AI SDK setup so swapping models later would be straightforward.

Temperature wasn’t fixed across runs; this suite was designed for directional signal rather than strict reproducibility.

‍

The evaluation setup

We kept the setup deliberately straightforward because the goal wasn’t academic benchmarking; it was a product decision.

We created two “memory bases” that contained the same information, just formatted differently. Naturally we used Han Solo as a test persona because it’s easy to define a stable set of facts and then ask questions that target them.

Variant A was 30 discrete fact pairs in Q&A format. Things like name, ship, co-pilot, origin, who hired him, what happened at Alderaan, what Leia said, and so on.

Variant B took those same 30 facts and embedded them into five open-ended prompts. Each prompt was sculpted to elicit a chunk of coherent narrative, ship and partner, background and joining the Rebellion, relationships with Luke and Leia, dangerous experiences, and personal change. The answers were longer, but there were far fewer prompts.

For memory injection, we took the simplest route: we placed the variant content directly into the system instructions. This isn’t a production memory architecture, but it’s a fast way to test a core question: given the model has access to this information, how reliably does it retrieve it under questioning?

Then we tested two types of recall.

The first type was direct recall: the simple “lookup” questions. If the user asks “What is your name?”, does the model answer correctly?

The second type was inferential recall, which is the more realistic conversational case, especially for our use case. These are questions where the model has to connect multiple facts or use the narrative context to answer correctly. A direct Q&A store might contain the pieces, but the question expects the model to join them into a coherent explanation. This is the gap you hit in real products when a user doesn’t ask in the exact structure you stored.

To score answers, we used an LLM-as-judge approach. The judge prompt was intentionally plain: you’re an impartial judge, compare the expected answer to the actual answer, and decide whether the meaning matches. The scoring scale was 0–3, with paraphrasing allowed because we cared about semantic correctness, not whether the model copied the exact same wording. A 3 meant the meaning was correct, a 0 meant contradiction or fabrication.

Across all evaluations we used the same harness: 28 questions total (15 direct, 13 inferential). Each evaluation ran 3 runs per variant, so 3 runs × 2 variants = 6 model runs per file. That produces 84 question–answer pairs per variant (28 × 3), which we score and then average per run. All 11 result files used google/gemini-3-flash.

What happened

Variant A (structured Q&A) consistently delivered perfect direct lookup. In the most recent run, Variant A scored 2.18/3 overall, with 3.00/3 on direct recall and 1.23/3 on inferential. This pattern repeats across files: direct recall stays perfect, but inferential questions drop sharply when the model has to connect facts stored as isolated pairs.

Variant B (open-ended narrative) flips that behaviour. In the same most recent run, Variant B scored 2.77/3 overall, with 2.73/3 on direct recall and 2.82/3 on inferential. Across all files, inferential recall stays high (~2.77–2.90, avg ~2.84), while direct recall is slightly lower (~2.67–2.80, avg ~2.73).

The implication is a real trade-off: narratives make the model much better at synthesis, but they can introduce small precision loss on single-fact lookups. Q&A is the reverse: great at exact retrieval, weak at connecting dots.

‍

‍

Why this matters for UX, not just evaluation scores

For this partner's use case, users are not going to interact with the avatar like it's a database. They won’t always ask “Who hired you to go to Alderaan?” They’ll ask “Why did you take that job?” or “How did you end up in the Rebellion?” or “What happened before you met Leia?” They ask sideways, and they expect the avatar to connect context.

If memories are stored as isolated atoms, the model tends to answer direct questions cleanly, but it struggles when the user asks sideways and expects reasoning across multiple facts. If memories are stored as narrative, inferential answers become far more reliable but single-fact lookups can lose a bit of exactness because the fact isn’t represented as an explicit key/value field.

And from a UX standpoint, there’s the practical reality: even thirty questions isn’t huge, but it feels heavy in onboarding. Five open-ended prompts feels more like a conversation and less like work, even if the total volume written is similar, and it also matched how users will be using our platform in the end.

Bias and limitations worth calling out

This evaluation is intentionally pragmatic, and there are real limitations.

Temperature wasn’t fixed, which means sampling variance is in the mix. That can inflate or reduce performance in ways that aren’t purely attributable to memory format, especially on inferential questions.

There’s also a craftsmanship bias: good open-ended prompts effectively “teach” the relationships between facts. That’s the point, but it also means prompt quality is a major variable. It’s difficult to define a general rule for how many Q&A pairs one open-ended prompt can safely cover without becoming too long or too easy to miss details.

Narratives are not a free win**:** even with good prompts, we saw a consistent small drop in direct recall under the narrative format, which reinforces the case for keeping precision-critical facts as explicit fields.

The memory injection method is simplified too. Injecting the memory base into system instructions is not the same as a production memory architecture where you might summarise, store, retrieve via similarity search, and re-inject context dynamically. This test answers “format vs recall under available context”, not “format vs recall under real retrieval constraints”.

Finally, LLM-as-judge scoring is a practical tool, not a perfect one. It’s good enough to spot meaningful differences, but it can be inconsistent at boundaries.

What this changed for us

The experiment gave us a clear signal: Q&A is best for precision, while open-ended narrative is best for synthesis. The open-ended format meaningfully increased inferential recall, but it wasn’t perfect on direct recall exactly the kind of small drop that matters when you need deterministic “profile” facts.

So our direction is a hybrid onboarding flow. We start with a short set of direct Q&A questions for the highest-signal, precision-critical fields (the keys we’ll search and rely on later). Then we follow with one crafted open-ended prompt to capture richer context. The “glue” that improves inferential answering and makes the avatar feel coherent in real conversation.

This avoids shoehorning the memory system into a single representation. We get stable keys where we need them, and narrative context where it pays off, without making onboarding feel like a 30-question form.

News &views

View our feed

Building AI that delivers business value - Q&A with our founder

Jordan Richards

Insight

Meet the makers: a chat with our Founder & CEO

Jordan Richards

News

The AI payoff starts with the problems you’ve been avoiding

David Hillier

Insight

View our feed