An AI-powered invoice extraction pipeline for a multi-country utility portfolio, confidence-graded, human-in-the-loop, honest about what it can and can’t read.
Crestfield runs a multi-country utility portfolio: every month, thousands of electricity, gas and water invoices arrive from five different countries, in five different formats, in five different dialects of the same financial language. Each one had to be opened, read, transcribed into the operations system and reconciled against meter data, by hand. The current process took two full-time operations consultants and still produced enough errors to undermine downstream reporting.
My role was discovery, UX lead and AI integration evaluation: interview the consultants who run the current process, model what an automated pipeline actually needs to do, and design the human-in-the-loop interface that surfaces what the AI can’t solve on its own. Engineering then built against the requirements that came out of the design phase.
I used Claude to structure the initial competitive comparison before primary research began, so discovery conversations could focus on what Crestfield specifically needed rather than what the market already offered.
Four problems tangled into one. None of them was “the AI can’t read PDFs.” All of them were about what happens when reading fails halfway.
Invoices arrived in five country-specific formats, different date conventions, different VAT vocabularies, different layouts, different fields entirely. A solution had to handle five dialects of one document type without five separate pipelines.
Two full-time operations consultants read invoices by hand every month. The work was repetitive, error-prone and impossible to scale as the portfolio grew. Their attention was being spent on transcription instead of validation.
Off-the-shelf OCR produced confident wrong answers, extracting a field that looked plausible but wasn’t. Worse than a missed field is one that’s wrong but reads as correct, because it disappears into downstream reporting without flagging.
Some invoices covered a hundred meters across a building portfolio. Existing tools assume one-row-per-meter and broke silently on these, attributing consumption to the wrong asset, or losing whole meters entirely.
A short discovery loop, competitive landscape first to focus the conversation, then primary interviews with the people who do the work today, then a structured engineering hackathon that pitted three extraction approaches against the same real invoices.
What does an operations consultant actually do when they sit down to process a batch of invoices, and which steps are the ones they don’t want a machine touching?
Which of the three obvious extraction approaches, OCR + rules, LLM parsing, or a hybrid, handles the messy edge cases best, and at what cost?
How does an extraction system tell its user “I’m not sure” in a way that’s actually useful, instead of just generating a confidence number nobody reads?
Six interviews across the team that currently processes invoices. The consistent signal was that the bottleneck isn’t reading, it’s checking. Consultants told us they trusted their own reading because they could see the page. They didn’t trust a tool that returned an answer without showing the source.
“I don’t want a system that does my job. I want a system that does the typing, points to the page when it’s not sure, and lets me decide. Anything else, I’ll just re-do its work to double-check it, which is worse than doing it myself.”Operations consultant, Crestfield Energy
That sentence became the design principle: override, never overrule. The system extracts. The human confirms. The interface always shows the source alongside the answer.

Engineering ran a hackathon evaluating three approaches against the same test set: pure OCR with rule-based field extraction, an LLM-led parse, and a hybrid that uses OCR for the layout-stable fields and an LLM for the ambiguous ones. The same 120 invoices across five countries. The hybrid won at 86%, but more importantly, it was the only approach that knew when it was wrong.
The percentage matters less than the shape of the failures. OCR failed predictably (whole field unreadable). LLM failed by hallucinating plausibly. Hybrid failed by flagging, which is the only kind of failure a human-in-the-loop tool can actually handle.
During the hackathon we noticed extraction accuracy correlated tightly with how often we’d seen a given supplier’s invoice template. Northwind’s layout was machine-learned after eight invoices. A new supplier’s first invoice always scored lower. The implication: parse rates compound. A “vendor fingerprint”, a recognised supplier profile, became a first-class feature, surfaced in the UI so users could see why a field was rated the way it was.
Five iterations of the extraction surface, six iterations of the review queue, and a confidence-grading scheme that survived contact with engineering.

The hackathon had given me three working extraction approaches to design around: pure OCR with rules, an LLM-led parse, and a hybrid. On raw accuracy they were close enough that a percentage alone wouldn't have decided it. What decided it was the shape of the failures. OCR failed where you'd expect, on unreadable fields, but had no way to express doubt. The LLM was the most fluent reader and the most dangerous, because it failed by inventing a plausible answer with no signal that it had. The hybrid was the only concept that produced a failure a human could act on: it flagged. I designed for the hybrid not because it scored highest but because it was the only one honest enough to hand work back.

A confidence number is useless if nobody reads it, so I tied confidence to behaviour instead of display. Three bands, each with a destination: High lands silently into the operations system, Low routes to the review queue, Failed or empty routes to manual entry. The thresholds weren't mine to assert, they were a negotiation. Engineering pushed back hard on where the lines sat, because every percentage point you move the boundary trades reviewer time against the risk of a wrong number slipping through silently. We set the bands against the real distribution of scores from the test invoices rather than round numbers, then agreed they'd move as the model learned. The scheme held because it was a routing rule the pipeline could enforce, not a label the user had to interpret.
Every band maps to a destination, auto-accept, review, or manual. The score does something instead of just being shown.
The system is allowed to say it doesn't know. A designed failure state is safer than a confident wrong answer disappearing into a report.
The obvious goal was full automation, and it was the wrong one. The consultants had told us plainly that a tool they couldn't trust just doubles the work, because they re-check everything it touches. So instead of chasing the last fraction of accuracy I designed the pipeline to be deliberately incomplete: it resolves what it's sure of and routes the rest, around one in seven invoices, into a queue sorted by the cost of getting them wrong. That kept the consultants doing the part of the job they actually valued, validation and judgement, and let the machine take the typing. Honest about what it can't read, by design, ended up being the feature, not the limitation. It also meant the schema had to generalise across the five country dialects without five separate pipelines, because the queue, not a per-country rulebook, absorbed whatever the shared model couldn't confidently place.
How the upload, review queue and confidence-graded results came together before the final UI.
Finance teams were receiving hundreds of invoices a week, PDFs, scans, phone photos, supplier spreadsheets, and re-typing every number into an ERP by hand. The tools that "did OCR" either dumped raw text or, worse, returned a clean-looking answer that was wrong and never said so. Before any pixels, we mapped the volume, the hands, and where trust broke.
From shadowing AP teams across four finance orgs. Rather than chase three separate goals, we committed to a single model, a confidence score on every extracted value, and let three principles fall out of where a value lands on it.
The biggest shift wasn't accuracy, it was making uncertainty visible. Every extracted value carries a confidence score that drives the UI: high auto-accepts, low gets flagged, fail gets escalated.
That single idea reshaped the data model. An extracted field became {value, confidence, source-region}, not just a string, and everything downstream attaches to it.
A black-box "100% done ✓" looks great in a sales demo. Showing amber and red on the same screen looks less magical. We chose the honest version because the people who live in the tool stop trusting anything that overstates itself.
The compromise: lo-fi wireframes early, so stakeholders could feel the confidence model before engineering committed to scoring.
Four passes. The first matched the old tools, one PDF, one upload. Each pass collided with the same reality: invoices arrive in batches, in every format, and the moment you make someone sort or convert them first, you've lost. The shipped intake takes the lot and normalizes after.
Between v2 and v3 we kept solving the wrong problem, how to sort formats at intake. The reframe: the user should never see the format question at all.
The dropzone takes anything; a detection step downstream routes PDFs, scans, images and sheets to the right parser. The seam is invisible to the person dropping files.
Per-format zones would have been cheaper to build, each parser owns its own entry. We took the harder route so the human-facing surface stayed a single, dumb, forgiving box.
Onboarding a new format now means adding a detector plus parser behind the seam, with zero change to the intake screen.
The core of Plio. We had to show not just what was extracted but how sure the machine was, and let that drive what happens next. The fight was over how much uncertainty to surface: hide it for a clean demo, or put green / amber / red on the table where people can act on it.
Raw percentages put the cognitive load on the user. We baked the thresholds into three bands with fixed colours, so the screen answers "do I need to look at this?" before anyone reads a digit.
Replay was the trust unlock, being able to see where on the page a value came from turned "do I believe this number" into a one-glance check.
Sales wanted the green-only view. We held the line on surfacing amber and red because the daily users, the people whose name is on the posting, only trust a tool that admits what it can't read.
Comp: one scoring component renders the meter, band and replay link; every screen downstream reads the same {value, confidence, source} shape.
When the machine isn't sure, a human decides. The whole question of the review screen was: how do you let someone correct a field without losing sight of the document it came from? Three layouts hid the source in some way. The fourth never does.
The human is the authority, not a rubber stamp. Plio proposes; the reviewer confirms or corrects. The interface makes correction the fast path and never auto-posts a flagged field on the user's behalf.
The split-screen is non-negotiable: the document and the data are always co-present, because every correction is really a question about the source.
On a laptop the two panes are tight, and a phone can't hold both. We shipped desktop-first and deferred a mobile review mode rather than compromise the side-by-side on the surface where the work actually happens.
The queue does the heavy lifting: by routing only around 20% to a human, the expensive split-screen is only ever opened for the invoices that earn it.
Two halves of the same maturity story. Failure: make the limits legible, so a half-read invoice is a glance not an investigation. Scale: split one bill across a hundred meters, learn each vendor's layout, and normalize every country's dialect into one schema.
Failure anatomy and the honest fingerprint come from the same belief as the confidence model: the tool earns trust by being clear about what it doesn't know, a red cell, an amber vendor, not by hiding it.
Attribution and dialect are where that honesty meets scale: hard, unglamorous problems (100-meter splits, five tax regimes) treated as first-class, not afterthoughts.
Recognition that compounds is only powerful once a vendor has volume. The first invoice from a new supplier is genuinely the weakest, and we chose to show that (amber fingerprint) rather than fake confidence on day one.
Comp: the fingerprint, anatomy and dialect views all read the same per-field confidence shape, the scoring component pays off four screens later.
Where we held the line, where we bent, and what we cut from scope. Every row here represents a debate that's still defensible today.
| Decision | What we picked | What we gave up | Why |
|---|---|---|---|
| Intake | kept One dropzone, any format, whole batch, detect & route server-side. | cut PDF-only upload; per-format tabs that made users sort their own inbox. | Users think in "these invoices," never in formats. The door has to be dumb and forgiving. |
| Confidence | kept A score on every field, surfaced as three colour bands that drive the action. | cut The black-box "100% done ✓" that demos beautifully and posts silent errors. | A confident wrong answer is the worst answer. Honest uncertainty is the whole product. |
| Score display | kept Bands plus meters plus labels (High / Review / Failed). | cut Raw percentages that make the user threshold every row in their head. | The screen should answer "do I need to look at this?" before anyone reads a digit. |
| Review layout | kept Split-screen, source pinned left, fields right, always co-present. | cut Form-only, modal-over-doc, and vertical stacking, each hid the source somehow. | Every correction is a question about the document. You can't answer it if you can't see it. |
| Authority | kept Override, never overrule, Plio proposes, the human confirms or corrects. | deferred Fully auto-posting flagged fields without a person in the loop. | The numbers post to a ledger with someone's name on them. Judgment stays human for v1. |
| Queue split | kept Auto-clear the confident ~80%; route only exceptions to a reviewer. | cut Sending every invoice through human review "to be safe." | Reviewing what's already certain wastes the one scarce resource, human attention. |
| Failure display | kept Paint confidence onto the document itself, green/amber/red cells. | cut A separate "errors" report nobody opens. | A half-read invoice should be triaged in one glance, on the artifact people already read. |
| Vendor learning | kept Per-vendor fingerprints that compound; honest cold-start on new suppliers. | cut Faking high confidence on a vendor's first invoice. | A 3-invoice vendor that reads amber is telling the truth. That's the brand. |
| Locale | kept Five locales normalized to one ISO schema, raw value retained beside it. | deferred Long tail of smaller markets to a later release. | DE/FR/GB/NL/ES cover the bulk of European AP volume. Ship those solid, signal the rest. |
| Mobile review | kept Desktop-first split-screen where the work actually happens. | deferred A phone review mode that can't hold source plus fields together. | Better to do one surface honestly than two badly. The queue keeps desktop load small. |
Three things we'd front-load on the next engagement with the same shape of problem.
Confidence-graded extraction, a split-screen review queue, attribution across multi-meter invoices, and a vendor-fingerprint surface that explained why a field was rated the way it was.
| Meter ID | Site | Period | kWh | Charge | Match |
|---|---|---|---|---|---|
| M-A104 | Exchange House · Floor 1 | Mar 2025 | 4,820 | £684.44 | Auto · 98% |
| M-A105 | Exchange House · Floor 2 | Mar 2025 | 5,140 | £729.88 | Auto · 96% |
| M-A106 | Exchange House · Floor 3 | Mar 2025 | 4,990 | £708.58 | Auto · 94% |
| M-A107 | Exchange House · Basement | Mar 2025 | 3,210 | £455.82 | Auto · 87% |
| M-A108 | Exchange House · Plant room | Mar 2025 | 6,420 | £911.64 | Auto · 99% |
| M-A109 | Exchange House · Common | Mar 2025 | 4,360 | £619.12 | Manual |
| Total · 6 meters | 28,940 | £4,109.48 | |||
Drag a month’s inbox onto the page; Plio fans the files out and starts parsing in parallel.
80% auto-cleared, 20% triaged. Each row is one decision, sorted by the cost of getting it wrong.
Throughput, queue depth and SLA in one strip, ops see the system’s health without opening a dashboard.
Five locales map to one schema. Adding a sixth country is a profile drop-in, not a code change.
The most important signal came from the engineering phase: the pre-processing filter, routing unsupported invoices to manual handling rather than forcing them through automation, had been drawn in exactly the right place during design. The model degraded precisely at the boundary the design had already defined.