The extraction took a day. The edge cases took a month.

Extracting data from a PDF invoice takes about a day to build.

Getting it to work on all your invoices takes much longer.

The happy path is easy: clean PDF, standard layout, vendor name top left, total bottom right, date somewhere in the middle. A few hours to map the fields, a few more to test. Done.

Then you hit the real invoices.

The one where “total” means something different because it’s a progress billing. The credit note formatted identically to a regular invoice except for a minus sign buried in a footnote. The vendor who sends an Excel file and calls it an invoice. The supplier whose PDF is actually a scanned image and the OCR returns garbage.

I’ve built invoice processing workflows for Exact. Every time, the extraction model handles the clean cases well. The time goes into the interpretation layer: what does this field mean in context? Is this final or partial? Does this amount include VAT?

What ended up working: build a confidence score into the extraction. High confidence — process automatically. Low confidence — route to a short review queue with the fields pre-filled, so the person just needs to verify rather than re-enter.

The goal isn’t 100% automation. It’s that the 20% of invoices that need a human get there fast, with the right information, instead of failing silently and turning into a problem three weeks later.

Budget a day for the extraction.

Budget a month for the edge cases. They’re worth it.

Menu

The extraction took a day. The edge cases took a month.

Read next

The accountant who blamed the AI

Chargebacks Should Not Be Manual

Month-end close doesn't slow down where you think

Need a similar system in your business?