Most organisations that come to us for AI delivery help have a working prototype. The model does what it's supposed to do. The demo went well. The exec team is excited.
Then it doesn't ship.
Or it ships, and something breaks in production that never showed up in testing. Or it ships and works technically, but nobody uses it because the workflow integration was an afterthought. Or it doesn't ship because legal review surfaces requirements that nobody thought about when the prototype was being built.
We've seen all of these. The common thread is that the prototype proved the model worked, but nobody was seriously thinking about delivery.
Why prototypes are misleading
A prototype answers one question: can the model do this thing? It doesn't answer the questions that actually determine whether a system ships.
Who owns model drift? How do you know when it's drifting? What's the feedback loop between production behaviour and retraining? How does a human override the model, and is that override auditable? What happens when the input distribution shifts in a way nobody anticipated? What does the rollback process look like?
These aren't hard questions. But they're invisible when you're building a prototype, because prototypes are designed to make things look easy.
What we did for one client
A mid-size UK lender came to us with a document intelligence model they'd been building for eight months. It could classify documents, extract key fields, and flag anomalies with reasonable accuracy on their test set. Their legal team had been involved late, and they'd surfaced a set of requirements that the model — as built — couldn't satisfy.
The requirements weren't unreasonable: full audit trail of every decision, a defined escalation path for low-confidence outputs, a testing framework that could demonstrate consistent performance across protected-characteristic inputs. Standard stuff for a financial services context. None of it had been in scope for the prototype.
We embedded for 14 weeks. The model itself barely changed. What changed was everything around it: the governance layer, the human review workflow, the testing infrastructure, and the documentation that legal needed to sign off.
The system shipped with zero compliance flags. Throughput improved by 3× over the manual process it replaced. The client's legal team, who had been the main obstacle, became advocates for the system internally.
The actual delivery challenge
The model is the easy part. The hard part is everything that makes it safe to run in production.
The hard part is:
Governance — who is accountable for what the model does, and how do you demonstrate that accountability to a regulator?
Integration — how does the AI system connect to the existing workflow, and what happens when it's wrong?
Testing — how do you build confidence in a system whose outputs are non-deterministic? (This is harder than it sounds and deserves its own article.)
Observability — how do you know the system is behaving in production the way it behaved in testing?
None of this is cutting-edge AI research. It's delivery discipline applied to a new class of system. The organisations that are shipping AI successfully are the ones that figured this out. The ones that aren't are still building prototypes.
If you're working through the prototype-to-production gap, we're happy to talk through where you are.

