A stack of invoices with different layouts and formats

The extraction took a day. The edge cases took a month.

Stanislav Kapustin May 6, 2026 invoices · automation · ocr · accounting · n8n · ai

Extracting data from a PDF invoice takes about a day to build.

Getting it to work on all your invoices takes much longer.

The happy path is easy: clean PDF, standard layout, vendor name top left, total bottom right, date somewhere in the middle. A few hours to map the fields, a few more to test. Done.

Then you hit the real invoices.

The one where “total” means something different because it’s a progress billing. The credit note formatted identically to a regular invoice except for a minus sign buried in a footnote. The vendor who sends an Excel file and calls it an invoice. The supplier whose PDF is actually a scanned image and the OCR returns garbage.

I’ve built invoice processing workflows for Exact. Every time, the extraction model handles the clean cases well. The time goes into the interpretation layer: what does this field mean in context? Is this final or partial? Does this amount include VAT?

What ended up working: build a confidence score into the extraction. High confidence — process automatically. Low confidence — route to a short review queue with the fields pre-filled, so the person just needs to verify rather than re-enter.

The goal isn’t 100% automation. It’s that the 20% of invoices that need a human get there fast, with the right information, instead of failing silently and turning into a problem three weeks later.

Budget a day for the extraction.

Budget a month for the edge cases. They’re worth it.

Read next

Three nearby posts worth opening next.

Need a similar system in your business?

If you have a manual workflow between tools, I can help map the logic, design the system, and automate it in a way your team can actually use.

svg