Thomas Mbanefo

Decision	What we picked	What we gave up	Why
Intake	kept One dropzone, any format, whole batch, detect & route server-side.	cut PDF-only upload; per-format tabs that made users sort their own inbox.	Users think in "these invoices," never in formats. The door has to be dumb and forgiving.
Confidence	kept A score on every field, surfaced as three colour bands that drive the action.	cut The black-box "100% done ✓" that demos beautifully and posts silent errors.	A confident wrong answer is the worst answer. Honest uncertainty is the whole product.
Score display	kept Bands plus meters plus labels (High / Review / Failed).	cut Raw percentages that make the user threshold every row in their head.	The screen should answer "do I need to look at this?" before anyone reads a digit.
Review layout	kept Split-screen, source pinned left, fields right, always co-present.	cut Form-only, modal-over-doc, and vertical stacking, each hid the source somehow.	Every correction is a question about the document. You can't answer it if you can't see it.
Authority	kept Override, never overrule, Plio proposes, the human confirms or corrects.	deferred Fully auto-posting flagged fields without a person in the loop.	The numbers post to a ledger with someone's name on them. Judgment stays human for v1.
Queue split	kept Auto-clear the confident ~80%; route only exceptions to a reviewer.	cut Sending every invoice through human review "to be safe."	Reviewing what's already certain wastes the one scarce resource, human attention.
Failure display	kept Paint confidence onto the document itself, green/amber/red cells.	cut A separate "errors" report nobody opens.	A half-read invoice should be triaged in one glance, on the artifact people already read.
Vendor learning	kept Per-vendor fingerprints that compound; honest cold-start on new suppliers.	cut Faking high confidence on a vendor's first invoice.	A 3-invoice vendor that reads amber is telling the truth. That's the brand.
Locale	kept Five locales normalized to one ISO schema, raw value retained beside it.	deferred Long tail of smaller markets to a later release.	DE/FR/GB/NL/ES cover the bulk of European AP volume. Ship those solid, signal the rest.
Mobile review	kept Desktop-first split-screen where the work actually happens.	deferred A phone review mode that can't hold source plus fields together.	Better to do one surface honestly than two badly. The queue keeps desktop load small.

Extraction confidence

Every field carries the confidence behind it

Each extracted value sits next to a confidence score and the rule that produced it. The page reads as a document the user can audit, not a black box.

plio / extract / INV-9017

Northern Power Ltd12 Trinity Way · Leeds LS1 4DJ

INVOICENP-2025-04-1283

Billing period

01 Mar – 31 Mar 2025

Charges

Energy supply (18,420 kWh @ 0.142)2,615.64

Standing charge (31 days)14.88

Climate change levy76.84

VAT?-

Total due£2,847.92

Extracted fields

6 of 7

Vendor

Northern Power Ltd

99%High

Invoice no.

NP-2025-04-1283

97%High

Period

01 Mar – 31 Mar 2025

94%High

Total kWh

18,420

86%High

Total amount

£2,847.92

81%Low

VAT %

- not found

0%Failed

4 auto · 1 review · 1 manualrouted

What you’re looking at

Three-tier confidence. High (≥90%) lands silently. Low (70–89%) routes to review. Failed (<70% or empty) routes to manual.

Provenance per field. Hover any field and the source rule + the original character range in the invoice surface together.

Failures are first-class. A red or empty field is a designed state, not an error. The number on the report can survive a question from finance.

Review queue

Humans touch only the exceptions

The queue lists the 14% of invoices the pipeline can’t fully resolve, sorted by the cost of getting them wrong. One screen, one decision, one click.

plio / queue / needs review

14% routed for review

93 invoices · 8 reviewers active

INV-9018

Iberdrola · ES

Low confidence on rate band, three candidate tariffs match.

MMaria Lopez · acct exec

INV-9019

Vattenfall · DE

Currency missing, EUR or SEK ambiguous on page 1.

JJonas Weber · acct exec

INV-9020

British Gas · UK

Two meter readings detected in a single table layout.

SSarah Chen · acct exec

What you’re looking at

Sorted by stakes. Highest-amount invoices and lowest-confidence fields surface first. Time goes where the risk is.

One screen, one decision. Accept writes the value to the report. Edit opens an inline field. Reject sends to manual handling.

Audit-grade trail. Every accept, edit and reject logs the user, the original extracted value, the chosen value and the reason code.

Meter attribution

One invoice. A hundred meters.

Multi-meter invoices broke every per-row assumption in the legacy pipeline. Plio resolves them into a table where each meter gets its own row, traceable back to its line in the source.

plio / extract / INV-9027 / meters

Source:Page 3, line 14·British Gas · multi-site supply summary

Meter ID	Site	Period	kWh	Charge	Match
M-A104	Exchange House · Floor 1	Mar 2025	4,820	£684.44	Auto · 98%
M-A105	Exchange House · Floor 2	Mar 2025	5,140	£729.88	Auto · 96%
M-A106	Exchange House · Floor 3	Mar 2025	4,990	£708.58	Auto · 94%
M-A107	Exchange House · Basement	Mar 2025	3,210	£455.82	Auto · 87%
M-A108	Exchange House · Plant room	Mar 2025	6,420	£911.64	Auto · 99%
M-A109	Exchange House · Common	Mar 2025	4,360	£619.12	Manual
Total · 6 meters			28,940	£4,109.48

What you’re looking at

Built for bulk. A single 12-page invoice can fan out into dozens of rows, each anchored to its source line.

Confidence per meter. A clean overall invoice can still have one ambiguous meter. The chip exposes it so the report doesn’t average ambiguity away.

Reconciliation lives here. This view is also the reconciliation surface for the finance team, same data, same source rules, no second tool.

Plio

The brief

Invoices in five dialects, read by two people, every month, until the volume broke the process.

Six conversations, three engineering experiments, one shared answer: be honest about confidence.

What we heard, from operations consultants

Three extraction approaches, one structured engineering hackathon

The vendor fingerprint, an unplanned finding

From competing extraction concepts to a shipped specification.

Three competing concepts, and why the hybrid was the one worth designing for

The confidence-grading scheme, and how it survived engineering

A review queue, not a chase for 100% automation

Wireframing the extraction pipeline

Copy-paste, re-keying, silent OCR.

The whole product hangs off one idea: score every field.

The as-is invoice journey: read it, key it, hope reconciliation catches the rest.

DecisionConfidence is a field, not a footnote.

TradeoffHonest uncertainty demos worse, and sells better.

One file at a time, to the whole inbox.

The pivotStop asking what format. Accept it, then normalize.

Tradeoff"Accept anything" pushes cost server-side.

Never a confident wrong answer.

DecisionBands over numbers. The threshold ships in the UI.

TradeoffShowing failures looks worse, builds trust.

Invoice left, data right, always.

And the queue that decides who even sees a human, 80% auto-clears, 20% routes to judgment.

DecisionOverride, never overrule.

TradeoffSplit-screen costs horizontal space.

Show the limits. Learn the rest.

DecisionMake the limits as visible as the wins.

TradeoffPer-vendor learning means a real cold start.

The decisions that shaped the rest.

If we did it again.

Score before you style.

Honesty is the feature.

Spend humans on judgment.

Four moments that did the work, once the AI was honest about its uncertainty.

Design constraints that held through engineering.

More case studies

Arbor

Kova

Verde