HOME ABOUT

Plio

An AI-powered invoice extraction pipeline for a multi-country utility portfolio, confidence-graded, human-in-the-loop, honest about what it can and can’t read.

plio / ingestion

Client
Crestfield Energy
Pseudonym used for confidentiality
Timeline
2025
Discovery + UX lead, ongoing
Role
Product designer
Discovery & UX lead
Team
Design lead (me) 2 Full-stack Developers 1 Product Manager 2 AI/ML Engineers

The brief

Crestfield runs a multi-country utility portfolio: every month, thousands of electricity, gas and water invoices arrive from five different countries, in five different formats, in five different dialects of the same financial language. Each one had to be opened, read, transcribed into the operations system and reconciled against meter data, by hand. The current process took two full-time operations consultants and still produced enough errors to undermine downstream reporting.

My role was discovery, UX lead and AI integration evaluation: interview the consultants who run the current process, model what an automated pipeline actually needs to do, and design the human-in-the-loop interface that surfaces what the AI can’t solve on its own. Engineering then built against the requirements that came out of the design phase.

I used Claude to structure the initial competitive comparison before primary research began, so discovery conversations could focus on what Crestfield specifically needed rather than what the market already offered.

Plio is mid-build. The discovery work, research, problem framing, architecture, screens, is shipped. The pipeline is implementing it now.
86%
Parsing accuracy across all three extraction approaches
80/20
Split, 80% of invoices auto-cleared, 20% routed to human review
3
Extraction approaches evaluated before the pipeline design was finalised
5
Countries tested, UK, Germany, US, Italy, Spain
Problem

Invoices in five dialects, read by two people, every month, until the volume broke the process.

Four problems tangled into one. None of them was “the AI can’t read PDFs.” All of them were about what happens when reading fails halfway.

Where it hurt

Invoices arrived in five country-specific formats, different date conventions, different VAT vocabularies, different layouts, different fields entirely. A solution had to handle five dialects of one document type without five separate pipelines.

Who felt it

Two full-time operations consultants read invoices by hand every month. The work was repetitive, error-prone and impossible to scale as the portfolio grew. Their attention was being spent on transcription instead of validation.

The trap

Off-the-shelf OCR produced confident wrong answers, extracting a field that looked plausible but wasn’t. Worse than a missed field is one that’s wrong but reads as correct, because it disappears into downstream reporting without flagging.

The hardest case

Some invoices covered a hundred meters across a building portfolio. Existing tools assume one-row-per-meter and broke silently on these, attributing consumption to the wrong asset, or losing whole meters entirely.

Research

Six conversations, three engineering experiments, one shared answer: be honest about confidence.

A short discovery loop, competitive landscape first to focus the conversation, then primary interviews with the people who do the work today, then a structured engineering hackathon that pitted three extraction approaches against the same real invoices.

Q1

What does an operations consultant actually do when they sit down to process a batch of invoices, and which steps are the ones they don’t want a machine touching?

Q2

Which of the three obvious extraction approaches, OCR + rules, LLM parsing, or a hybrid, handles the messy edge cases best, and at what cost?

Q3

How does an extraction system tell its user “I’m not sure” in a way that’s actually useful, instead of just generating a confidence number nobody reads?

What we heard, from operations consultants

Six interviews across the team that currently processes invoices. The consistent signal was that the bottleneck isn’t reading, it’s checking. Consultants told us they trusted their own reading because they could see the page. They didn’t trust a tool that returned an answer without showing the source.

“I don’t want a system that does my job. I want a system that does the typing, points to the page when it’s not sure, and lets me decide. Anything else, I’ll just re-do its work to double-check it, which is worse than doing it myself.”Operations consultant, Crestfield Energy

That sentence became the design principle: override, never overrule. The system extracts. The human confirms. The interface always shows the source alongside the answer.

Three extraction approaches, one structured engineering hackathon

Engineering ran a hackathon evaluating three approaches against the same test set: pure OCR with rule-based field extraction, an LLM-led parse, and a hybrid that uses OCR for the layout-stable fields and an LLM for the ambiguous ones. The same 120 invoices across five countries. The hybrid won at 86%, but more importantly, it was the only approach that knew when it was wrong.

The percentage matters less than the shape of the failures. OCR failed predictably (whole field unreadable). LLM failed by hallucinating plausibly. Hybrid failed by flagging, which is the only kind of failure a human-in-the-loop tool can actually handle.

OCR + rules
Layout-driven
Blind
no doubt signal
Fails on an unreadable field, with no way to say it is unsure.
LLM parse
Most fluent reader
Fluent
but invents answers
Fails by hallucinating a plausible answer with no signal it is wrong.
Hybrid
OCR + LLM, per field
86%
parsing accuracy
Fails by flagging, the only failure a human-in-the-loop tool can act on.

The vendor fingerprint, an unplanned finding

During the hackathon we noticed extraction accuracy correlated tightly with how often we’d seen a given supplier’s invoice template. Northwind’s layout was machine-learned after eight invoices. A new supplier’s first invoice always scored lower. The implication: parse rates compound. A “vendor fingerprint”, a recognised supplier profile, became a first-class feature, surfaced in the UI so users could see why a field was rated the way it was.

Discovery

From competing extraction concepts to a shipped specification.

Five iterations of the extraction surface, six iterations of the review queue, and a confidence-grading scheme that survived contact with engineering.

Three competing concepts, and why the hybrid was the one worth designing for

The hackathon had given me three working extraction approaches to design around: pure OCR with rules, an LLM-led parse, and a hybrid. On raw accuracy they were close enough that a percentage alone wouldn't have decided it. What decided it was the shape of the failures. OCR failed where you'd expect, on unreadable fields, but had no way to express doubt. The LLM was the most fluent reader and the most dangerous, because it failed by inventing a plausible answer with no signal that it had. The hybrid was the only concept that produced a failure a human could act on: it flagged. I designed for the hybrid not because it scored highest but because it was the only one honest enough to hand work back.

The confidence-grading scheme, and how it survived engineering

A confidence number is useless if nobody reads it, so I tied confidence to behaviour instead of display. Three bands, each with a destination: High lands silently into the operations system, Low routes to the review queue, Failed or empty routes to manual entry. The thresholds weren't mine to assert, they were a negotiation. Engineering pushed back hard on where the lines sat, because every percentage point you move the boundary trades reviewer time against the risk of a wrong number slipping through silently. We set the bands against the real distribution of scores from the test invoices rather than round numbers, then agreed they'd move as the model learned. The scheme held because it was a routing rule the pipeline could enforce, not a label the user had to interpret.

High
Above the upper band
Lands silently in the operations system
Low
Between the bands
Routes to the review queue
Failed / empty
Below the lower band
Routes to manual entry
Confidence is a route, not a badge

Every band maps to a destination, auto-accept, review, or manual. The score does something instead of just being shown.

A flag beats a guess

The system is allowed to say it doesn't know. A designed failure state is safer than a confident wrong answer disappearing into a report.

A review queue, not a chase for 100% automation

The obvious goal was full automation, and it was the wrong one. The consultants had told us plainly that a tool they couldn't trust just doubles the work, because they re-check everything it touches. So instead of chasing the last fraction of accuracy I designed the pipeline to be deliberately incomplete: it resolves what it's sure of and routes the rest, around one in seven invoices, into a queue sorted by the cost of getting them wrong. That kept the consultants doing the part of the job they actually valued, validation and judgement, and let the machine take the typing. Honest about what it can't read, by design, ended up being the feature, not the limitation. It also meant the schema had to generalise across the five country dialects without five separate pipelines, because the queue, not a per-country rulebook, absorbed whatever the shared model couldn't confidently place.

Wireframes

Wireframing the extraction pipeline

How the upload, review queue and confidence-graded results came together before the final UI.

Copy-paste, re-keying, silent OCR.

Finance teams were receiving hundreds of invoices a week, PDFs, scans, phone photos, supplier spreadsheets, and re-typing every number into an ERP by hand. The tools that "did OCR" either dumped raw text or, worse, returned a clean-looking answer that was wrong and never said so. Before any pixels, we mapped the volume, the hands, and where trust broke.

v0 · as-is
How invoices got processed
workflow audit
TWO WINDOWS, ONE HUMAN, ALL DAY
invoice_8812.pdf
Total£4,226.71
ERP · new entry
Supplier | typing…
Amount _____
VAT ref _____
Period _____
WHAT THE "OCR" TOOLS DID
A) raw text dump, re-key anyway
B) clean answer, silently wrong
A wrong total posts straight to the ledger. Nobody knows until reconciliation.
The source and the data never sat next to each other. Every check meant alt-tabbing.
Manual keying, roughly 3 to 4 min per invoice, hundreds a week, transcription errors baked in.
Existing OCR was all-or-nothing: no per-field signal, no "I'm not sure about this one".
User quote: "I trust it least when it looks most finished."
v0 · users
Who an invoice passes through
personas
Inside finance ops, the invoice passes through, in order
01AP clerk
Keys every field, owns the volume
02Reviewer
Spot-checks, catches the bad ones
03Finance manager
Approves, owns the numbers
04Auditor
Asks "where did this number come from?"
Outside · upstream
SUPSuppliers
Every one a different layout & locale
PMProperty mgr
One bill, many meters
SYSERP / ledger
Wants a clean schema
One invoice, many checks. The old platform modelled extraction as a single step. In reality it crosses seven roles before it's trusted.
The clerk wants speed; the manager wants correctness; the auditor wants provenance. One screen has to serve all three.
Suppliers are the wild card, no two invoices alike, five-plus countries, comma-vs-dot decimals.

The whole product hangs off one idea: score every field.

From shadowing AP teams across four finance orgs. Rather than chase three separate goals, we committed to a single model, a confidence score on every extracted value, and let three principles fall out of where a value lands on it.

≥ 95% confidence Confident auto-clears, no human touch
70 – 95% Low flagged amber for a quick check
< 70% Fail routed to a person, source in view
PRINCIPLE 01Absorb the whole inbox. Any format in, PDFs, photos, scans, and never make AP sort or convert before the work starts.
PRINCIPLE 02Be honest about uncertainty. Every field carries its score. A confident wrong answer is worse than an admitted "I'm not sure."
PRINCIPLE 03Spend humans only on judgment. The confident majority clears itself; only the ambiguous tail ever reaches a person.

The as-is invoice journey: read it, key it, hope reconciliation catches the rest.

Receive
Read
Key in
Review
Post & reconcile
AP clerk
downloads from inbox, sorts by format
reads PDF / scan / photo by eye
re-types every field into the ERP
fixes whatever the reviewer flags
submits the batch
Reviewer / Mgr
re-opens the PDF to check key totals
eyeballs a sample, misses the rest
approves on trust
System
no record of the source doc
OCR returns text or a silent guess
errors surface weeks later at month-end

DecisionConfidence is a field, not a footnote.

The biggest shift wasn't accuracy, it was making uncertainty visible. Every extracted value carries a confidence score that drives the UI: high auto-accepts, low gets flagged, fail gets escalated.

That single idea reshaped the data model. An extracted field became {value, confidence, source-region}, not just a string, and everything downstream attaches to it.

TradeoffHonest uncertainty demos worse, and sells better.

A black-box "100% done ✓" looks great in a sales demo. Showing amber and red on the same screen looks less magical. We chose the honest version because the people who live in the tool stop trusting anything that overstates itself.

The compromise: lo-fi wireframes early, so stakeholders could feel the confidence model before engineering committed to scoring.

One file at a time, to the whole inbox.

Four passes. The first matched the old tools, one PDF, one upload. Each pass collided with the same reality: invoices arrive in batches, in every format, and the moment you make someone sort or convert them first, you've lost. The shipped intake takes the lot and normalizes after.

v1
Single-file upload
inherited
Drop a PDF here
.pdf only · max 1 file
Upload & extract
Real AP teams have 200+ invoices waiting. One-at-a-time is a non-starter.
PDF-only rejected scans, photos and supplier spreadsheets, i.e. half the inbox.
v2
Multi-select list
band-aid
PDF
PDF
JPG
rejected
XLS
rejected
"why won't it take my scan?"
Batches at last, multiple files in one go.
Still gatekept by format. Users had to convert scans/sheets to PDF first.
v3
A zone per format
over-engineered
PDFs
Scans
Images
Sheets
drop PDFs here
…then repeat for the other 3 tabs
Every format finally supported.
Four separate drops for one batch. Users don't think in formats, they think "these invoices."
final
One zone · any format · whole batch
shipped
PDF Scan Image XLSX
12 invoices · 4.2 MB
format auto-detected · no sorting needed
Parsing started, 12 queued, ~60 fields
Accept everything. Detect the format server-side, never at the door.
One dropzone swallows the whole inbox, mixed formats, one action.
Format detection moved off the user and onto the parser.
Live parse progress plus a count toast set the expectation: this is a batch tool.

The pivotStop asking what format. Accept it, then normalize.

Between v2 and v3 we kept solving the wrong problem, how to sort formats at intake. The reframe: the user should never see the format question at all.

The dropzone takes anything; a detection step downstream routes PDFs, scans, images and sheets to the right parser. The seam is invisible to the person dropping files.

Tradeoff"Accept anything" pushes cost server-side.

Per-format zones would have been cheaper to build, each parser owns its own entry. We took the harder route so the human-facing surface stayed a single, dumb, forgiving box.

Onboarding a new format now means adding a detector plus parser behind the seam, with zero change to the intake screen.

Never a confident wrong answer.

The core of Plio. We had to show not just what was extracted but how sure the machine was, and let that drive what happens next. The fight was over how much uncertainty to surface: hide it for a clean demo, or put green / amber / red on the table where people can act on it.

concept
The confidence model, drawn for the team first
three bands
EVERY FIELD GETS A SCORE 0–100 → < 40 · fail 40–85 · low > 85 · high Couldn't read region unreadable / missing. → escalate, never guess. ESCALATE Needs review a plausible read, not certain. → route to a human. FLAG High confidence cross-checks agree. → auto-accept, post it. AUTO-ACCEPT
One score, three actions. The threshold is the product, not a setting buried in admin.
Score is a blend: OCR plus rules, an LLM parse, and historical agreement. The hybrid wins (see board v4).
v1
Black box
looks magic
Extraction complete
14 fields extracted · 100%
100% of what? It read the VAT ref as the account number and still said "done".
No per-field signal, the exact failure mode from the as-is OCR tools.
Demos beautifully, then quietly posts a wrong total to the ledger.
v2
Flat field dump
over-correct
Supplier
Consumption
Meter
Period
VAT ref
Every field shown, no more hidden black box.
All fields look equal. Which one do I check first? No triage signal at all.
v3
A wall of percentages
noisy
97%
95%
92%
58%
22%
is 92% good? is 58% bad? a number isn't a decision.
Confidence is finally per-field.
Raw numbers make the user do the thresholding in their head, every row.
final
Meters + bands + replay
shipped
Supplier
High
Consumption
High
Billing period
Review
VAT ref
Failed
3 auto-accepted · 2 flagged for review
The score is the UI. Colour plus band tell you where to look without reading a number.
Meter, band and label give triage at a glance; the eye lands on amber and red first.
"Replay" lets any field trace back to the exact region on the page it came from.
The summary line makes the hand-off explicit: most auto-clear, a few need you.

DecisionBands over numbers. The threshold ships in the UI.

Raw percentages put the cognitive load on the user. We baked the thresholds into three bands with fixed colours, so the screen answers "do I need to look at this?" before anyone reads a digit.

Replay was the trust unlock, being able to see where on the page a value came from turned "do I believe this number" into a one-glance check.

TradeoffShowing failures looks worse, builds trust.

Sales wanted the green-only view. We held the line on surfacing amber and red because the daily users, the people whose name is on the posting, only trust a tool that admits what it can't read.

Comp: one scoring component renders the meter, band and replay link; every screen downstream reads the same {value, confidence, source} shape.

Invoice left, data right, always.

When the machine isn't sure, a human decides. The whole question of the review screen was: how do you let someone correct a field without losing sight of the document it came from? Three layouts hid the source in some way. The fourth never does.

v1
Edit in a form
no source
Supplier · Northwind…
Period · 01.07 to 31.07 ?
Total · £4,226.71
Save corrections
to check the period I have to go find the PDF in another tab.
Correcting a field with no document means guessing, or alt-tabbing to the source.
Recreates the exact two-window problem from the as-is.
v2
Modal over the doc
covers context
Fix: Billing period
01 Jul to 31 Jul 2024
the popup hides the line I need to read.
Source and field on one screen, at last.
The modal covers the very region the value came from. Move it and you cover something else.
v3
Stacked: doc over fields
endless scroll
document
fields
Period · 01 Jul to 31 Jul
Nothing is covered, both regions are present.
Vertical stacking means scrolling up to the doc, down to the field, repeat, for every fix.
final
Split-screen review
shipped
Plio · Review
Awaiting review
SOURCE
EXTRACTED
Supplier ✓
Period → fixed
Total ✓
Confirm field → state flips to ✓ Confirmed
fix it where you see it. tag flips amber to green, badge to Confirmed.
Source pinned left, fields right, never covered, never scrolled apart.
Inline correction: edit, confidence flips to high, header badge moves to Confirmed.
Highlight links the focused field to its exact region on the invoice.

And the queue that decides who even sees a human, 80% auto-clears, 20% routes to judgment.

THIS BATCH · 47
44
auto-cleared
3
need you
YOUR QUEUE · 3 EXCEPTIONS
Meridian Logistics GmbH
RESOLVED
2
Voltex Industries SA · PO reference unclear
Approve
3
Nordic Power AS · multi-meter split
QUEUED
Work the exceptions one by one. The other 44 already posted.

DecisionOverride, never overrule.

The human is the authority, not a rubber stamp. Plio proposes; the reviewer confirms or corrects. The interface makes correction the fast path and never auto-posts a flagged field on the user's behalf.

The split-screen is non-negotiable: the document and the data are always co-present, because every correction is really a question about the source.

TradeoffSplit-screen costs horizontal space.

On a laptop the two panes are tight, and a phone can't hold both. We shipped desktop-first and deferred a mobile review mode rather than compromise the side-by-side on the surface where the work actually happens.

The queue does the heavy lifting: by routing only around 20% to a human, the expensive split-screen is only ever opened for the invoices that earn it.

Show the limits. Learn the rest.

Two halves of the same maturity story. Failure: make the limits legible, so a half-read invoice is a glance not an investigation. Scale: split one bill across a hundred meters, learn each vendor's layout, and normalize every country's dialect into one schema.

final
Failure anatomy, the doc is the dashboard
legible failure
Meridian Logistics GmbH
VAT DE812…
INV-0091
14. März 2024
EUR acct
PO-REF: [unclear]
Net €12.500
VAT €2,375
IBAN DE44 ████ 00
Total €14,875
9
clean
2
review
2
failed
Confidence is painted onto the invoice itself, failure is a shape, not a report.
A reviewer triages a half-read doc in one look: go straight to the red cells.
final
Meter attribution, one bill, many meters
utilities
INVOICE TOTAL
482,600 kWh
100 meters · auto-split
Tower A · Floors 1 to 4
38,200 kWh
Tower B · Retail units
27,910 kWh
+ 98 more · attributed by EAN match
Attribution treated as a first-class problem, not a CSV export.
EAN match plus historical pattern splits a single bill across every meter automatically.
final
Vendor fingerprint, recognition compounds
learns
Siemens AG
47 inv · DE · 98%
Nexport Ltd
3 inv · GB · 62%
SIEMENS · FIELD PARSE RATES
Amount
Date
PO ref
new vendor means cold start. by invoice 40, near-perfect.
Every supplier builds a profile from repeated invoices, parse rates climb with volume.
A weak fingerprint flags itself: a 3-invoice vendor is honestly amber, not falsely green.
final
Country dialect, five locales, one schema
multi-locale
DE FR GB NL ES
RAW FROM DOC
14. März 2024
€ 18.450,00
MwSt. 19%
MAPPED BY PLIO
2024-03-14
18450.00 EUR
VAT_DE 19%
Comma decimal · German month name · Mehrwertsteuer, one ISO schema
Comma-vs-dot, DD/MM-vs-MM/DD, MwSt/TVA/IVA, all normalized to one system-ready schema.
The raw value is kept beside the mapped one, so an auditor can always see the original.

DecisionMake the limits as visible as the wins.

Failure anatomy and the honest fingerprint come from the same belief as the confidence model: the tool earns trust by being clear about what it doesn't know, a red cell, an amber vendor, not by hiding it.

Attribution and dialect are where that honesty meets scale: hard, unglamorous problems (100-meter splits, five tax regimes) treated as first-class, not afterthoughts.

TradeoffPer-vendor learning means a real cold start.

Recognition that compounds is only powerful once a vendor has volume. The first invoice from a new supplier is genuinely the weakest, and we chose to show that (amber fingerprint) rather than fake confidence on day one.

Comp: the fingerprint, anatomy and dialect views all read the same per-field confidence shape, the scoring component pays off four screens later.

The decisions that shaped the rest.

Where we held the line, where we bent, and what we cut from scope. Every row here represents a debate that's still defensible today.

Decision What we picked What we gave up Why
Intake kept One dropzone, any format, whole batch, detect & route server-side. cut PDF-only upload; per-format tabs that made users sort their own inbox. Users think in "these invoices," never in formats. The door has to be dumb and forgiving.
Confidence kept A score on every field, surfaced as three colour bands that drive the action. cut The black-box "100% done ✓" that demos beautifully and posts silent errors. A confident wrong answer is the worst answer. Honest uncertainty is the whole product.
Score display kept Bands plus meters plus labels (High / Review / Failed). cut Raw percentages that make the user threshold every row in their head. The screen should answer "do I need to look at this?" before anyone reads a digit.
Review layout kept Split-screen, source pinned left, fields right, always co-present. cut Form-only, modal-over-doc, and vertical stacking, each hid the source somehow. Every correction is a question about the document. You can't answer it if you can't see it.
Authority kept Override, never overrule, Plio proposes, the human confirms or corrects. deferred Fully auto-posting flagged fields without a person in the loop. The numbers post to a ledger with someone's name on them. Judgment stays human for v1.
Queue split kept Auto-clear the confident ~80%; route only exceptions to a reviewer. cut Sending every invoice through human review "to be safe." Reviewing what's already certain wastes the one scarce resource, human attention.
Failure display kept Paint confidence onto the document itself, green/amber/red cells. cut A separate "errors" report nobody opens. A half-read invoice should be triaged in one glance, on the artifact people already read.
Vendor learning kept Per-vendor fingerprints that compound; honest cold-start on new suppliers. cut Faking high confidence on a vendor's first invoice. A 3-invoice vendor that reads amber is telling the truth. That's the brand.
Locale kept Five locales normalized to one ISO schema, raw value retained beside it. deferred Long tail of smaller markets to a later release. DE/FR/GB/NL/ES cover the bulk of European AP volume. Ship those solid, signal the rest.
Mobile review kept Desktop-first split-screen where the work actually happens. deferred A phone review mode that can't hold source plus fields together. Better to do one surface honestly than two badly. The queue keeps desktop load small.

If we did it again.

Three things we'd front-load on the next engagement with the same shape of problem.

retro · 01
Score before you style.
The {value, confidence, source} field shape paid off on every screen. We'd model that on day one, not after the first UI.
retro · 02
Honesty is the feature.
The hardest argument was keeping amber and red on screen. We'd write "never a confident wrong answer" as a principle before scoping anything.
retro · 03
Spend humans on judgment.
The queue's 80/20 split was the unlock. We'd design the auto-clear threshold before the review screen next time, not after.
Final UI

Four moments that did the work, once the AI was honest about its uncertainty.

Confidence-graded extraction, a split-screen review queue, attribution across multi-meter invoices, and a vendor-fingerprint surface that explained why a field was rated the way it was.

Extraction confidence
Every field carries the confidence behind it
Each extracted value sits next to a confidence score and the rule that produced it. The page reads as a document the user can audit, not a black box.
1
2
3
plio / extract / INV-9017
Northern Power Ltd12 Trinity Way · Leeds LS1 4DJ
INVOICENP-2025-04-1283
Billing period
01 Mar – 31 Mar 2025
Charges
Energy supply (18,420 kWh @ 0.142)2,615.64
Standing charge (31 days)14.88
Climate change levy76.84
VAT?-
Total due£2,847.92
Extracted fields
6 of 7
Vendor
Northern Power Ltd
99%High
Invoice no.
NP-2025-04-1283
97%High
Period
01 Mar – 31 Mar 2025
94%High
Total kWh
18,420
86%High
Total amount
£2,847.92
81%Low
VAT %
- not found
0%Failed
4 auto · 1 review · 1 manualrouted
What you’re looking at
1
Three-tier confidence. High (≥90%) lands silently. Low (70–89%) routes to review. Failed (<70% or empty) routes to manual.
2
Provenance per field. Hover any field and the source rule + the original character range in the invoice surface together.
3
Failures are first-class. A red or empty field is a designed state, not an error. The number on the report can survive a question from finance.
Review queue
Humans touch only the exceptions
The queue lists the 14% of invoices the pipeline can’t fully resolve, sorted by the cost of getting them wrong. One screen, one decision, one click.
1
2
3
plio / queue / needs review
14% routed for review
93 invoices · 8 reviewers active
INV-9018
Iberdrola · ES
Low confidence on rate band, three candidate tariffs match.
MMaria Lopez · acct exec
INV-9019
Vattenfall · DE
Currency missing, EUR or SEK ambiguous on page 1.
JJonas Weber · acct exec
INV-9020
British Gas · UK
Two meter readings detected in a single table layout.
SSarah Chen · acct exec
What you’re looking at
1
Sorted by stakes. Highest-amount invoices and lowest-confidence fields surface first. Time goes where the risk is.
2
One screen, one decision. Accept writes the value to the report. Edit opens an inline field. Reject sends to manual handling.
3
Audit-grade trail. Every accept, edit and reject logs the user, the original extracted value, the chosen value and the reason code.
Meter attribution
One invoice. A hundred meters.
Multi-meter invoices broke every per-row assumption in the legacy pipeline. Plio resolves them into a table where each meter gets its own row, traceable back to its line in the source.
1
2
3
plio / extract / INV-9027 / meters
Source:Page 3, line 14·British Gas · multi-site supply summary
Meter IDSitePeriodkWhChargeMatch
M-A104Exchange House · Floor 1Mar 20254,820£684.44Auto · 98%
M-A105Exchange House · Floor 2Mar 20255,140£729.88Auto · 96%
M-A106Exchange House · Floor 3Mar 20254,990£708.58Auto · 94%
M-A107Exchange House · BasementMar 20253,210£455.82Auto · 87%
M-A108Exchange House · Plant roomMar 20256,420£911.64Auto · 99%
M-A109Exchange House · CommonMar 20254,360£619.12Manual
Total · 6 meters28,940£4,109.48
What you’re looking at
1
Built for bulk. A single 12-page invoice can fan out into dozens of rows, each anchored to its source line.
2
Confidence per meter. A clean overall invoice can still have one ambiguous meter. The chip exposes it so the report doesn’t average ambiguity away.
3
Reconciliation lives here. This view is also the reconciliation surface for the finance team, same data, same source rules, no second tool.
More of the system
Impact

Design constraints that held through engineering.

86%
Parsing accuracy across all three extraction approaches
80/20
Split, 80% of invoices auto-cleared, 20% routed to human review
3
Extraction approaches evaluated before the pipeline design was finalised
5
Countries tested, UK, Germany, US, Italy, Spain

The most important signal came from the engineering phase: the pre-processing filter, routing unsupported invoices to manual handling rather than forcing them through automation, had been drawn in exactly the right place during design. The model degraded precisely at the boundary the design had already defined.

More case studies

Three other projects, two of them sharing the same product family, the foundation layer Plio plugs into, the analytics surface that sits above it, and a reporting workflow built on the same hierarchy.