In 2020, OpenAI published a paper with a bold title: Language Models are Few-Shot Learners. GPT-3 could be shown a handful of examples in its prompt and perform tasks it was never trained on. Translation, arithmetic, code generation, all from a few demonstrations at inference time. It felt like a breakthrough.
It was. But the title was wrong.
What GPT-3 demonstrated was few-shot prompting, not few-shot learning. Show the model three examples of sentiment classification, and it classifies the fourth correctly. Close the chat window, and it has learned nothing. The "learning" lived in the context window, and the context window is a whiteboard that gets erased after every conversation.
Six years later, this is still true. Every personalisation feature in today's LLMs, from ChatGPT's memory to system prompts to RAG pipelines, is a workaround for the same limitation: the model cannot learn from experience. It can only be reminded of it.
What would it take to make the title true? The pieces already exist. They just haven't been connected.
Attention is (almost) gradient descent
Von Oswald et al. (2023) proved that a single linear self-attention layer performs a computation equivalent to one step of gradient descent on a regression loss. Mahankali et al. showed this is provably optimal for one-layer linear transformers.
The intuition is simple. Standard learning updates weights via a gradient step: w_new = w_old - α · ∇L(w_old, data). In self-attention, KV cache entries from in-context examples create a temporary delta on the model's output, steering behaviour toward the demonstrated pattern. For linear attention, this delta is mathematically what you'd get from one gradient step on those examples.
The KV cache is a transient weight update. Every transformer already has the machinery for learning. It's just trapped in volatile memory.
The equivalence is exact for linear attention on regression tasks. Real transformers are messier. But the mental model holds, and it makes the bottleneck obvious: the learning mechanism exists, there's just no way to make it permanent.
The two memory systems
LLMs have two places to store information, with fundamentally different properties.
The KV cache is working memory. Every token gets encoded as key-value pairs that subsequent tokens attend over. Fast to write, but bounded by context length and gone when the conversation ends. You can compress it (MEMENTO gets 2.5x reduction through learned summarisation), but you're still squeezing a finite buffer.
Model weights are long-term storage. Billions of parameters, persistent, massive capacity.
Both KV cache and weights do the same job: they modulate activations to steer the model's output. In SFT, you change weights so activations shift to the new distribution. In ICL, the KV cache shifts activations while the weights stay frozen. Think of it as software vs hardware. Context is software running on fixed hardware (the frozen weights). Weight updates redesign the hardware itself. A well-designed general-purpose processor (a well-pretrained model) can run a huge variety of programs (ICL tasks). But when the task requires computational primitives the ISA doesn't have, no amount of clever programming helps. Weight modification adds new instructions to the architecture.
In practice, this means: knowledge in weights is O(1) at inference (baked into the forward pass), while context-based knowledge is O(n) (every generation attends over the full context). Weights are also a vastly more compressed encoding: a LoRA adapter of a few thousand parameters can capture an adaptation that might require millions of context tokens. And weight updates compose multiplicatively across steps, each building on the last, while context additions all flow through the same frozen computation.
An obvious objection: why not just make context windows longer? I wrote about this in detail in Context Is Software, Weights Are Hardware. The short version: both KV cache and weights modulate activations, but the frozen weights are a fixed meta-learning algorithm with a ceiling at the pretraining distribution boundary. Even within that ceiling, weights win on efficiency (O(1) vs O(n)), compression, and composability.
The bottleneck isn't just that updating weights is slow. It's that weights are entangled. Billions of parameters jointly encode everything the model knows — language, reasoning, facts, style — all overlapping in the same high-dimensional space. Change one capability and you risk quietly corrupting another. This is why production systems are deliberately designed not to learn online. It's not a missing feature. It's a safety constraint.
Even if weight updates were instant and free, bounded interference would remain: the guarantee that new knowledge doesn't degrade existing competence. That's the real bottleneck.
Unless someone can make weight updates cheap, fast, and interference-safe.
The bridge exists
In February 2026, Sakana AI published "Doc-to-LoRA." It is, in my view, a proof of concept for something much bigger than the paper claims.
A 309-million parameter hypernetwork (Perceiver architecture) takes a document as input and outputs LoRA adapter weights for a target LLM:
- Feed the document through a frozen LLM to extract per-layer activations
- The hypernetwork reads those activations via cross-attention
- It outputs rank-8 LoRA matrices for each target layer
- Merge the LoRA matrices into the target LLM
Under one second. After injection, the model answers questions about the document without the document in its context window. The knowledge lives in the weights.
The numbers: near-perfect needle-in-haystack at 4x the model's native context length, 83.5% of full-context QA performance with sub-second update time (vs 40+ seconds for standard distillation), under 50 MB constant memory regardless of document length.
But the numbers aren't the point. What Doc-to-LoRA proves is more fundamental: there exists a learnable function f(context) → Δweights that preserves the information in context. A neural network can learn where and how to inject knowledge into an LLM's weights without breaking what's already there.
This is the Optimal Brain Surgeon (Hassibi & Stork, 1993) made constructive. OBS used second-order derivatives to find which weights to remove with minimal damage. Doc-to-LoRA's hypernetwork has learned the inverse: which weights to modify, and how, to add knowledge with minimal disruption. A learned surgeon that builds instead of cuts.
Three threads, one synthesis
Doc-to-LoRA solves single-shot injection but has no agency. What it doesn't show is whether sequential injections compose: inject ten documents, a hundred, and does the model degrade? That's an open question, and it's where the gap between "injection works" and "continual learning works" lives. To close that gap, three threads need to connect.
Thread 1: RL for memory management. Mem-α and Memory-R1 (2025) use RL to train LLM agents to manage their own memory: what to store, what to discard, optimised for downstream performance. It works. But both target token-space, text in external databases, which means the same retrieval bottleneck.
Thread 2: Learned injection. Doc-to-LoRA proves context → weight transfer is learnable and fast. But the "what to inject" decision is external.
Thread 3: Self-modifying networks. Irie et al. (ICML 2022) demonstrated self-referential weight matrices that modify themselves at runtime. Toy-scale so far, but the theoretical machinery exists.
The synthesis: redirect the RL memory research at weight-space instead of token-space. Use Doc-to-LoRA style injection as the mechanism. And recognise that the hypernetwork is scaffolding, destined to be absorbed into the LLM itself.
The absorption trajectory
Three stages:
Stage 1 (past): Manual fine-tuning. A human decides what to train on. An engineer runs gradient descent offline. The model is a passive recipient.
Stage 2 (present): Learned injection. Doc-to-LoRA. A hypernetwork learns how to inject context into weights. Fast and automatic, but the what is still decided externally. The surgeon is skilled but takes orders from someone else.
Stage 3 (future): Self-directed learning. The LLM encounters information during inference, evaluates what's worth keeping via a learned policy, and triggers weight updates using an internal injection mechanism. The hypernetwork is absorbed. The model is both surgeon and patient.
Stage 3 is an RL problem. State: the model's current weights plus information in context. Action: what to consolidate, which layers, what rank. Reward: future task performance.
The temporal credit assignment is hard. The reward for remembering something today might not arrive for weeks. But this is exactly what RL is built for, and the reward signal is clean: downstream accuracy improves, or it doesn't.
Mem-α and Memory-R1 have already proven the loop works for token-space memory: RL policy → decides what to store → store operation → future tasks test the decision → reward updates the policy. Replace "store operation" with "generate and merge a LoRA adapter via learned injection," and you have the same loop targeting weight-space.
Neuroscience got there first
The Complementary Learning Systems theory (McClelland et al., 1995) describes exactly this architecture in biological brains. The hippocampus does fast, ephemeral encoding. The neocortex does slow, persistent storage. During sleep, experiences are selectively replayed from hippocampus to neocortex. Not everything makes it. The brain has a consolidation policy.
The mapping is direct: hippocampus → KV cache, neocortex → weights, sleep consolidation → learned injection, the brain's consolidation policy → the RL-trained weight-update policy.
Recent work is making this concrete. SleepGate (2025) proposes sleep-inspired modules for LLMs: key decay, learned gating, consolidation. NeuroDream introduces explicit dream phases for replay. The pieces are moving.
The convergence isn't a coincidence. Evolution and ML research face the same constraint: fast ephemeral memory is cheap but bounded, slow persistent memory is expensive but vast. The optimal solution is a learned transfer policy. Evolution found it 500 million years ago. We're rediscovering it.
What this gets us
If this works, "few-shot learning" stops being a misnomer. A model that encounters information, decides what's worth keeping, commits it to its own weights, and accesses it intrinsically in all future interactions is a model that actually learns from experience. Not conditions on experience. Learns.
Personalisation stops being a hack. Not "prepend your preferences to every prompt," but the model genuinely knowing you, your style, your project context, woven into parameters with no retrieval step.
Knowledge stays current without retraining runs. Context windows become working memory, not the whole story. You don't need a million-token window if the model already knows what it needs to know.
Specialisation stops requiring infrastructure. Today, improving a model in a domain means building verifiers, curating tasks, defining reward functions, and running a training job. This works for broad, verifiable capabilities: math has ground-truth answers, code has test suites. It completely fails for long-tail expertise — the architectural conventions of a specific codebase, the regulatory nuances of a specific industry, how a specific user thinks about problems. You cannot build an RL environment for everything, so those things never make it into weights. A model that learns from its own deployment experience doesn't need any of that infrastructure. It accumulates expertise by doing: failures become training signal, successful strategies consolidate into weights. The RL environment isn't built by engineers. It emerges from the model's interaction with the world.
The open problems are real, but one dominates: bounded interference under sequential updates. Speed is tractable — Doc-to-LoRA already solved it. Decision policy is tractable — RL has the right structure. The hard problem is the guarantee that a model which has updated itself a thousand times still behaves coherently, that new knowledge doesn't silently corrupt old competence. Doc-to-LoRA shows one injection works. Nobody has shown a thousand do.
The other problems matter too: temporal credit assignment in the RL loop, the interpretability gap (token-space memories are human-readable, weight-space memories are opaque), the compute cost of online updates during serving.
But unlike five years ago, we can point at each component and say: this piece works. RL for memory management works. Learned injection works. Self-modifying networks work at small scale. The assembly is the engineering challenge. History suggests it won't stay unsolved for long: attention, LoRA, RLHF, each went from paper to production in under three years.
GPT-3's title was six years early. The few-shot learners are coming. They just need to learn how to sleep.