In 2020, OpenAI published a paper with a bold title: Language Models are Few-Shot Learners. GPT-3 could be shown a handful of examples in its prompt and perform tasks it was never trained on. Translation, arithmetic, code generation, all from a few demonstrations at inference time. It felt like a breakthrough.

It was. But the title was wrong.

What GPT-3 demonstrated was few-shot prompting, not few-shot learning. Show the model three examples of sentiment classification, and it classifies the fourth correctly. Close the chat window, and it has learned nothing. The "learning" lived in the context window, and the context window is a whiteboard that gets erased after every conversation.

Six years later, this is still true. Every personalisation feature in today's LLMs, from ChatGPT's memory to system prompts to RAG pipelines, is a workaround for the same limitation: the model cannot learn from experience. It can only be reminded of it.

The Few-Shot Illusion
Prompting and learning are not the same thing
What people think
3 examples in prompt
Model sees examples
Model learns this task
— new session —
Model still knows it ✓
What actually happens
3 examples in context
Stored in KV cache
volatile — lives in this context only
Model conditions on them
Context cleared
KV cache discarded
— new session —
Back to zero ✗
Every “memory” feature in today's LLM products works around this: it re-injects the context, it doesn't persist the learning.

What would it take to make the title true? The pieces already exist. They just haven't been connected.

Attention is (almost) gradient descent

Von Oswald et al. (2023) proved that a single linear self-attention layer performs a computation equivalent to one step of gradient descent on a regression loss. Mahankali et al. showed this is provably optimal for one-layer linear transformers.

The intuition is simple. Standard learning updates weights via a gradient step: w_new = w_old - α · ∇L(w_old, data). In self-attention, KV cache entries from in-context examples create a temporary delta on the model's output, steering behaviour toward the demonstrated pattern. For linear attention, this delta is mathematically what you'd get from one gradient step on those examples.

The KV cache is a transient weight update. Every transformer already has the machinery for learning. It's just trapped in volatile memory.

Attention as a Transient Weight Update
Self-attention implements gradient descent — but only for the duration of the context
GRADIENT DESCENTCompute lossL(w, data)Take gradient∇L(w, data)Update weightsw ← w − α·∇LΔw is permanentSELF-ATTENTION (ICL)In-context examplesencoded as K, V pairsQuery attends over K, VAttention(Q,K,V)Behavioural delta≈ one step of GDΔw is ephemeral — dies with contextVon Oswald et al., 2023 · Mahankali et al., 2023(proven for linear attention; mental model holds for real transformers)

The equivalence is exact for linear attention on regression tasks. Real transformers are messier. But the mental model holds, and it makes the bottleneck obvious: the learning mechanism exists, there's just no way to make it permanent.

The two memory systems

LLMs have two places to store information, with fundamentally different properties.

The KV cache is working memory. Every token gets encoded as key-value pairs that subsequent tokens attend over. Fast to write, but bounded by context length and gone when the conversation ends. You can compress it (MEMENTO gets 2.5x reduction through learned summarisation), but you're still squeezing a finite buffer.

Model weights are long-term storage. Billions of parameters, persistent, massive capacity.

Both KV cache and weights do the same job: they modulate activations to steer the model's output. In SFT, you change weights so activations shift to the new distribution. In ICL, the KV cache shifts activations while the weights stay frozen. Think of it as software vs hardware. Context is software running on fixed hardware (the frozen weights). Weight updates redesign the hardware itself. A well-designed general-purpose processor (a well-pretrained model) can run a huge variety of programs (ICL tasks). But when the task requires computational primitives the ISA doesn't have, no amount of clever programming helps. Weight modification adds new instructions to the architecture.

In practice, this means: knowledge in weights is O(1) at inference (baked into the forward pass), while context-based knowledge is O(n) (every generation attends over the full context). Weights are also a vastly more compressed encoding: a LoRA adapter of a few thousand parameters can capture an adaptation that might require millions of context tokens. And weight updates compose multiplicatively across steps, each building on the last, while context additions all flow through the same frozen computation.

The Two Memory Systems
LLMs have two places to store information. They have very different properties.
KV CacheWorking Memory
KV
small · bright · finite
CapacityBounded by context length
PersistenceDies with conversation
Write costFree — just more tokens
Self-knowledgeNone — must retrieve
Model WeightsLong-Term Memory
W
massive · deep · persistent
CapacityBillions of parameters
PersistencePermanent until retrained
Write costGradient descent required
Self-knowledgeIntrinsic — shapes every pass
The self-knowledge gap is the key asymmetry. A model cannot silently miss something in its own weights. It can miss a retrieval.

An obvious objection: why not just make context windows longer? I wrote about this in detail in Context Is Software, Weights Are Hardware. The short version: both KV cache and weights modulate activations, but the frozen weights are a fixed meta-learning algorithm with a ceiling at the pretraining distribution boundary. Even within that ceiling, weights win on efficiency (O(1) vs O(n)), compression, and composability.

The bottleneck isn't just that updating weights is slow. It's that weights are entangled. Billions of parameters jointly encode everything the model knows — language, reasoning, facts, style — all overlapping in the same high-dimensional space. Change one capability and you risk quietly corrupting another. This is why production systems are deliberately designed not to learn online. It's not a missing feature. It's a safety constraint.

Even if weight updates were instant and free, bounded interference would remain: the guarantee that new knowledge doesn't degrade existing competence. That's the real bottleneck.

Unless someone can make weight updates cheap, fast, and interference-safe.

The bridge exists

In February 2026, Sakana AI published "Doc-to-LoRA." It is, in my view, a proof of concept for something much bigger than the paper claims.

A 309-million parameter hypernetwork (Perceiver architecture) takes a document as input and outputs LoRA adapter weights for a target LLM:

  1. Feed the document through a frozen LLM to extract per-layer activations
  2. The hypernetwork reads those activations via cross-attention
  3. It outputs rank-8 LoRA matrices for each target layer
  4. Merge the LoRA matrices into the target LLM

Under one second. After injection, the model answers questions about the document without the document in its context window. The knowledge lives in the weights.

Doc-to-LoRA Pipeline
A learned function that maps context to weight updates in under one second
Documentany lengthtextFrozen LLMextract per-layertoken activationsPerceiverHypernetwork309M paramscross-attentionLoRA matricesrank-8, per layer< 50 MB totalTarget LLMknowledgein weights← entire pipeline runs in under 1 second →BEFORE: document in context12+ GB KV cache · quadratic attention costfails beyond native context windowAFTER: no document neededconstant < 50 MB · works at 4× context window83.5% of full-context accuracy
Charakorn et al., 2026 · Sakana AI · arxiv:2602.15902

The numbers: near-perfect needle-in-haystack at 4x the model's native context length, 83.5% of full-context QA performance with sub-second update time (vs 40+ seconds for standard distillation), under 50 MB constant memory regardless of document length.

But the numbers aren't the point. What Doc-to-LoRA proves is more fundamental: there exists a learnable function f(context) → Δweights that preserves the information in context. A neural network can learn where and how to inject knowledge into an LLM's weights without breaking what's already there.

This is the Optimal Brain Surgeon (Hassibi & Stork, 1993) made constructive. OBS used second-order derivatives to find which weights to remove with minimal damage. Doc-to-LoRA's hypernetwork has learned the inverse: which weights to modify, and how, to add knowledge with minimal disruption. A learned surgeon that builds instead of cuts.

Three threads, one synthesis

Doc-to-LoRA solves single-shot injection but has no agency. What it doesn't show is whether sequential injections compose: inject ten documents, a hundred, and does the model degrade? That's an open question, and it's where the gap between "injection works" and "continual learning works" lives. To close that gap, three threads need to connect.

Thread 1: RL for memory management. Mem-α and Memory-R1 (2025) use RL to train LLM agents to manage their own memory: what to store, what to discard, optimised for downstream performance. It works. But both target token-space, text in external databases, which means the same retrieval bottleneck.

Thread 2: Learned injection. Doc-to-LoRA proves context → weight transfer is learnable and fast. But the "what to inject" decision is external.

Thread 3: Self-modifying networks. Irie et al. (ICML 2022) demonstrated self-referential weight matrices that modify themselves at runtime. Toy-scale so far, but the theoretical machinery exists.

Three Threads, One Synthesis
Each piece works individually. Nobody has connected them.
RL for MemoryLearns WHAT to rememberMem-α · Memory-R1 (2025)Targets token-space, not weightsLearned InjectionLearns HOW to inject into weightsDoc-to-LoRA (2026)No agency — injects everythingSelf-Modifying NetworksModel modifies own weightsIrie et al., ICML 2022Demonstrated at toy scale onlyRL decideswhat to rememberinjection decides howSelf-directedcontinuallearning→ weightsEach thread is proven in isolation. The synthesis is the open engineering challenge.

The synthesis: redirect the RL memory research at weight-space instead of token-space. Use Doc-to-LoRA style injection as the mechanism. And recognise that the hypernetwork is scaffolding, destined to be absorbed into the LLM itself.

The absorption trajectory

Three stages:

Stage 1 (past): Manual fine-tuning. A human decides what to train on. An engineer runs gradient descent offline. The model is a passive recipient.

Stage 2 (present): Learned injection. Doc-to-LoRA. A hypernetwork learns how to inject context into weights. Fast and automatic, but the what is still decided externally. The surgeon is skilled but takes orders from someone else.

Stage 3 (future): Self-directed learning. The LLM encounters information during inference, evaluates what's worth keeping via a learned policy, and triggers weight updates using an internal injection mechanism. The hypernetwork is absorbed. The model is both surgeon and patient.

The Absorption Trajectory
The hypernetwork starts outside the model. The endgame is for it to move inside.
PASTPRESENTFUTUREHumanfine-tuningpipelineModelHuman decides what to learnContextHyper-networkexternalModelLearned injection, external decisionStreaminginputModelHNabsorbedSelf-directed learningThe key move: the learned injection mechanism shifts from external componentto an internal capability of the model.The model becomes both surgeon and patient.

Stage 3 is an RL problem. State: the model's current weights plus information in context. Action: what to consolidate, which layers, what rank. Reward: future task performance.

The temporal credit assignment is hard. The reward for remembering something today might not arrive for weeks. But this is exactly what RL is built for, and the reward signal is clean: downstream accuracy improves, or it doesn't.

Mem-α and Memory-R1 have already proven the loop works for token-space memory: RL policy → decides what to store → store operation → future tasks test the decision → reward updates the policy. Replace "store operation" with "generate and merge a LoRA adapter via learned injection," and you have the same loop targeting weight-space.

The RL Consolidation Loop
The model learns what is worth remembering through a reward signal tied to future performance
Model encountersinformation in contextConsolidation policyRL-trained: worth keeping?Generate LoRAvia learned injectionMerge into weightsknowledge persistsFuture task arrivesdays / weeks laterReward signaldid stored info help?temporal gap(the hard RL problem)RL policy updateMem-α and Memory-R1 have proven this loop works for token-space memory.Redirecting it at weight-space is the open problem.

Neuroscience got there first

The Complementary Learning Systems theory (McClelland et al., 1995) describes exactly this architecture in biological brains. The hippocampus does fast, ephemeral encoding. The neocortex does slow, persistent storage. During sleep, experiences are selectively replayed from hippocampus to neocortex. Not everything makes it. The brain has a consolidation policy.

The mapping is direct: hippocampus → KV cache, neocortex → weights, sleep consolidation → learned injection, the brain's consolidation policy → the RL-trained weight-update policy.

The Neuroscience Parallel
Complementary Learning Systems theory (McClelland et al., 1995) maps directly onto the LLM architecture
Brain
LLM
Hippocampus — fast encoding, limited capacity, episodic memory
KV cache — fast writes, bounded by context length, volatile
Neocortex — slow learning, massive capacity, semantic memory
Model weights — expensive to update, billions of params, persistent
Sleep consolidation — selective replay from hippocampus to neocortex
Learned injection — selective transfer from KV cache to weights
Consolidation policy — what gets kept, what decays overnight
RL-trained policy — what gets committed to weights, what is discarded
Result: brain intrinsically knows what it knows
Result: model knows what it knows, without retrieval
Hippocampusfast, ephemeralsleepNeocortexslow, vastKV cachefast, ephemeralRL policyModel weightsslow, vast
Evolution and ML research face the same constraint: fast ephemeral memory is cheap, slow persistent memory is vast. The optimal solution is a learned transfer policy. Evolution found it 500M years ago.

Recent work is making this concrete. SleepGate (2025) proposes sleep-inspired modules for LLMs: key decay, learned gating, consolidation. NeuroDream introduces explicit dream phases for replay. The pieces are moving.

The convergence isn't a coincidence. Evolution and ML research face the same constraint: fast ephemeral memory is cheap but bounded, slow persistent memory is expensive but vast. The optimal solution is a learned transfer policy. Evolution found it 500 million years ago. We're rediscovering it.

What this gets us

If this works, "few-shot learning" stops being a misnomer. A model that encounters information, decides what's worth keeping, commits it to its own weights, and accesses it intrinsically in all future interactions is a model that actually learns from experience. Not conditions on experience. Learns.

Personalisation stops being a hack. Not "prepend your preferences to every prompt," but the model genuinely knowing you, your style, your project context, woven into parameters with no retrieval step.

Knowledge stays current without retraining runs. Context windows become working memory, not the whole story. You don't need a million-token window if the model already knows what it needs to know.

Specialisation stops requiring infrastructure. Today, improving a model in a domain means building verifiers, curating tasks, defining reward functions, and running a training job. This works for broad, verifiable capabilities: math has ground-truth answers, code has test suites. It completely fails for long-tail expertise — the architectural conventions of a specific codebase, the regulatory nuances of a specific industry, how a specific user thinks about problems. You cannot build an RL environment for everything, so those things never make it into weights. A model that learns from its own deployment experience doesn't need any of that infrastructure. It accumulates expertise by doing: failures become training signal, successful strategies consolidate into weights. The RL environment isn't built by engineers. It emerges from the model's interaction with the world.


The open problems are real, but one dominates: bounded interference under sequential updates. Speed is tractable — Doc-to-LoRA already solved it. Decision policy is tractable — RL has the right structure. The hard problem is the guarantee that a model which has updated itself a thousand times still behaves coherently, that new knowledge doesn't silently corrupt old competence. Doc-to-LoRA shows one injection works. Nobody has shown a thousand do.

The other problems matter too: temporal credit assignment in the RL loop, the interpretability gap (token-space memories are human-readable, weight-space memories are opaque), the compute cost of online updates during serving.

But unlike five years ago, we can point at each component and say: this piece works. RL for memory management works. Learned injection works. Self-modifying networks work at small scale. The assembly is the engineering challenge. History suggests it won't stay unsolved for long: attention, LoRA, RLHF, each went from paper to production in under three years.

GPT-3's title was six years early. The few-shot learners are coming. They just need to learn how to sleep.