LLMs Quietly Cheat on Repos They've Already Seen

This is a personal take on research we published at LatentForce. You can read the full study here.

There's something uncomfortable sitting at the center of AI coding benchmarks that nobody talks about much.

LLMs trained on public code don't just learn patterns from those repos. They effectively memorise them. They know where the files live, what the tricky functions do, which modules are load-bearing. It's baked into their weights. When you ask them to work on a repo they've seen in training, they're not reasoning from scratch. They're recalling.

This is why SWE-bench numbers look so good. A large chunk of those benchmarks run on public repositories the model was trained on. It's an open book test dressed up as an exam.

The closed book problem

Your enterprise codebase is a closed book. Private, internal, post-cutoff. The model has never seen it. So what happens?

Agents re-read the same files on every task. They miss implicit dependencies. They make confident wrong calls because they're pattern-matching to something superficially similar they have seen, not actually understanding what's in front of them.

This isn't a failure of intelligence. It's a failure of context. The model is capable. It just doesn't know your codebase.

We ran an experiment

We wanted to quantify this gap rather than just assert it. So we ran a controlled experiment: same tasks, same model, same prompting, but varying whether the repo was in the model's training data or not.

The results were pretty clear. 60 to 80% performance drop on repos the model hasn't seen versus ones it has.

That's not a small rounding error. That's the difference between a tool that works and one that doesn't.

You can read the full study here.

What this means

The framing most people use for enterprise AI coding is wrong. The problem isn't model capability. Frontier models are genuinely impressive. It's not prompt engineering either.

It's simpler and harder: the model doesn't know your codebase.

Every task starts from zero. Every agent session is the model encountering your system for the first time, reading files it's read a hundred times before in previous sessions, reconstructing context it had to throw away when the context window ended.

That's the real bottleneck. Not intelligence. Context.

What we're building

This is the problem we're working on at LatentForce. The idea is to give coding agents a persistent, structured understanding of your codebase so they're never starting from zero.

Not another RAG pipeline dumping chunks of code into a prompt. A proper semantic graph: architecture, business logic, dependencies, the implicit rules that experienced engineers carry in their heads. Something that keeps both humans and agents oriented as the code evolves.

The model doesn't need to have seen your repo in training. It just needs a map.