Tips & Best Practices

99.9% accuracy explained: how we measure it and why it matters

An engineering post on what 99.9% accuracy actually means — which fields, which document types, which correctable with human-in-the-loop, and how we measure it continuously.

Published 20 April 2026 · 4 min read accuracy engineering quality measurement

“99.9% accuracy” is a number that gets thrown around too casually in the AI PO tooling market. In this article we explain how OrderPilot’s number is measured, on which document types, and where the limits are. It’s an engineering post, not a marketing pitch.

What we actually measure

Accuracy is often presented as one number but has multiple layers:

Field-level accuracy — per field (supplier, PO number, item code, quantity, price, VAT code, etc.) we compare the AI extraction against the ground truth from our own labelled dataset or human review.
Document-level accuracy — a document counts as “correct” only if every field is correct. This number is by definition lower than field-level.
End-to-end accuracy — the percentage of documents that reach the ERP correctly without human intervention. This is the most honest metric because it also factors in confidence thresholds and automation rules.

When we publish “99.9%” we mean field-level accuracy on established suppliers (after ~30 days of traffic and a handful of corrective reviews). At document level that number sits lower, and at end-to-end it depends on your confidence thresholds.

The breakdown

Based on rolling-month telemetry across our customers:

Scenario	Header fields	Line-item fields	Edge-case fields
Day 1, new supplier, no prior data	~97%	~94%	~88%
After 30 days (20-50 POs per supplier)	99.8%	99.5%	98.5%
Established suppliers, 90+ days	99.9%+	99.7%	99.0%

“Edge-case fields” = VAT code, cost center, delivery address, payment terms, unit of measure, etc.

Note: our day-1 numbers are honest. A tool that promises “99% on day 1” for a never-seen supplier is not realistic unless billions of similar documents are already in the model’s training corpus.

Why not 100%

A few classes of errors that don’t go away without human review:

Handwritten fields on poor-quality scans. OCR on a skewed photo of a crumpled order form will never be 100%.
Inconsistent source information. The supplier states a total that doesn’t match the sum of lines. OrderPilot flags this — but determining the “correct” number requires human judgement.
Ambiguous item codes. “Bolt M12x80” could map to two different internal codes in your master data (galvanized vs stainless). Without context it’s not deterministically solvable.

For these cases there is no “perfect” AI — only “AI + good human-in-the-loop UI”. Our goal is that your team only reviews the 0.1% that needs attention, not the 100%.

How we measure it

This is the technical part. We run four tracks in parallel:

1. Gold-set benchmark (weekly)

An internal dataset of ~2000 documents, hand-labelled, stable across releases. On every model upgrade we check regressions across all fields. A release can’t ship if any field drops below its baseline from the previous release.

2. Live confidence calibration

Every extracted field gets a confidence score (0.0-1.0). We plot the calibration curve weekly: how often is 90%-confidence actually 90% accurate? If the curve diverges (overconfident), that blocks deploys.

3. Customer-correction signal

When a customer fixes a field in the review UI, we log it. Per customer, per supplier, per field. This is the fastest feedback loop: if a specific supplier’s PO format consistently gets a field wrong, we see it within 20-50 documents.

4. Shadow evaluation in production

For every PO flowing through our pipeline we (optionally) run a second “shadow” model in parallel. Discrepancies between main and shadow get sent to a reviewer. This catches regressions the gold set misses, because production documents are always more diverse.

What’s NOT in the number

Being upfront:

Detection errors (wrong document-type classification). If an invoice is accidentally processed as a PO, that doesn’t count in field accuracy. It’s a separate metric (document-classification accuracy, currently 99.7%).
At-source errors. If the supplier writes a wrong PO number in their document, we will extract the wrong-but-correctly-shown number. That’s accurate extraction, not a correct PO.
Downstream ERP-push errors. Our AI can read a PO perfectly; if AFAS then rejects the push due to a master-data mismatch, that’s not an extraction error but a mapping issue. We measure ERP-push success separately.

Comparison with competitors

For context, numbers competitors publish:

OrderPilot: 99.9% field-level, established suppliers
Workist: 80-90% error reduction (not field-level accuracy; different metric)
OrderEase: 98% reduction in order entry errors
DocumentPro: 98% extraction accuracy
Hyperfox: no public accuracy claim

These numbers are not 1-to-1 comparable because everyone measures differently. When evaluating, we recommend testing yourself with a sample of your own documents — that’s the only accuracy that actually counts.

What you can do to maximize your accuracy

Clean master data. Make sure your vendor and item masters are up to date. 40% of “AI errors” are actually matching problems.
Correct in the review UI, not post-hoc in the ERP. Every correction in OrderPilot’s UI trains the matching memory for that specific supplier.
Set confidence thresholds based on risk. For high-risk suppliers or large amounts you can require a higher confidence threshold, forcing human review.
Read the accuracy reports we email per workspace each month. They show you per-field and per-supplier how it’s going.

99.9% accuracy explained: how we measure it and why it matters

What we actually measure

The breakdown

Why not 100%

How we measure it

1. Gold-set benchmark (weekly)

2. Live confidence calibration

3. Customer-correction signal

4. Shadow evaluation in production

What’s NOT in the number

Comparison with competitors

What you can do to maximize your accuracy

Further reading

What's the real cost of processing POs by hand?

What we actually measure

The breakdown

Why not 100%

How we measure it

1. Gold-set benchmark (weekly)

2. Live confidence calibration

3. Customer-correction signal

4. Shadow evaluation in production

What’s NOT in the number

Comparison with competitors

What you can do to maximize your accuracy

Further reading

Related articles

What's the real cost of processing POs by hand?