AI project ROI: stop promising it, start measuring it

An investment committee signs off €400,000 for an AI project. On the decision slide, one line: “+30% productivity.” Nobody asks on which task, measured how, or against what starting point. The project ships. Eighteen months later, no one can say whether it earned a cent, because no one ever defined what they were watching. We see this in almost every audit we run. The problem isn’t the AI. It’s that the ROI was promised instead of being designed to be measured.

Let’s be blunt: most AI ROI numbers shown to committees are fiction. Not lies. Fiction. A round, attractive figure, drawn from an ideal use case and never tested against production. A case that holds up requires the opposite: a metric tied to real money, a baseline taken before, and the honesty to count every cost.

“+30% productivity” means nothing

A productivity gain isn’t value until it converts into a cost avoided or revenue earned. Thirty percent of time saved on a task nobody bills, and that isn’t the team’s bottleneck, is zero euros on the P&L. Time “saved” that dissolves into meetings and coffee breaks has never lowered an invoice.

So the first decision is the metric. It has to tie the project to a cost or a revenue line your finance team already recognises. A few solid candidates:

  • Hours saved, valued: only if they turn into roles not backfilled, fewer temps, or capacity redirected to billable work.
  • Straight-through processing rate (the share of cases handled end to end without human touch): directly tied to cost per case.
  • Error rate: when an error carries a known price, like a credit note, a late payment, a penalty, a rework cycle.
  • Cycle time: if it unlocks revenue or reduces working capital.
  • Churn: if you can put a number on the value of a retained customer.

One metric, not five. Tied to money, not comfort. If you can’t write “one point of this metric = X euros,” you don’t have an ROI case yet. You have an intention.

No baseline before, no ROI after

This is the most expensive mistake, and the most common. Teams ship first, then go looking for proof of the gain. But with no measurement taken before, the “after” has nothing to compare against, and the retrospective estimate always leans in favour of whoever championed the project.

A baseline is measured in the field, not in a process document. How many invoices a month, really? How many go to exception handling? How much time does an accountant actually spend on them, rework and chasing included? Take it over several weeks to absorb seasonality, before the first line of code. A baseline measured after the project starts is a reconstructed memory.

The cost checklist everyone underestimates

The denominator of ROI is total cost. And that’s where the numbers quietly deflate. Build is almost never the main line item over time. Here’s what an honest case has to include:

  • Build: design, development, integration into your systems. The one line everyone counts.
  • Tokens and inference: cost per call, multiplied by real volume. Negligible in a demo, material at scale.
  • Data: collection, cleaning, labelling, quality work. Often the heaviest line, and the most forgotten.
  • MLOps / LLMOps: hosting, monitoring, evaluation sets, deployment pipelines.
  • Maintenance: models drift, a vendor changes an API, a prompt breaks. Budget a recurring load, not a one-off cost.
  • Change management: training, support, process redesign. Without it, adoption stalls and the gain stays theoretical.
  • Hidden cost of human correction: time spent reviewing, fixing, catching the system’s output — often enough to halve the headline gain.

That last line is the one that decides everything. A system that automates 80% of cases but whose output you still have to check one by one does not save you 80% of the work. Quality control has a cost, and it doesn’t vanish because an AI is in the loop.

Watch the unit cost at scale too. A pilot on 200 cases tells you almost nothing about cost at 50,000: variable costs (inference, human review, support) rise linearly, sometimes faster as edge cases multiply. Reason in cost per unit processed, at target volume. This is also where the trade-offs that move ROI get made: a smaller model, caching or batch processing can cut inference cost without touching value. None of it shows up in a POC; all of it decides profitability in production.

Theoretical gain versus realised gain

Between the gain you calculate in a spreadsheet and the gain that reaches the P&L, two filters sit in the way. Adoption: a tool used by 40% of the team doesn’t deliver 100% of the benefit, and that’s almost always the picture in year one. Human takeover: the share of cases the system hands back to a person, out of caution or inability. A 70% automation rate that drops to 55% once the guardrails are switched on is a quarter of the expected value gone.

An honest case therefore shows a range, not a point. Theoretical gain at the top, expected realised gain at the bottom, after applying adoption and takeover assumptions you’re willing to own. Then you update it with real figures once the system is live. ROI isn’t a promise made at launch. It’s a measurement tracked over time.

A worked example: automating supplier invoices

Orders of magnitude to show the mechanics, not a case study. A company processes around 30,000 supplier invoices a year. Measured baseline: 8 minutes of human work per invoice on average, roughly 4,000 hours a year, valued at say €35 per loaded hour, somewhere around €140,000 of processing cost annually.

The goal: automate extraction and matching, target a straight-through rate of 70%. On paper, 70% of €140,000 is close to €100,000 in annual savings. The theoretical gain. Now run it through the filters of reality.

  • The real rate settles closer to 55% in the early months (edge cases, suppliers with exotic formats).
  • The remaining 45% still need a human, and around 10% of the “automated” cases get reviewed for safety.
  • Estimated realised gain: closer to €60,000–70,000 a year, not €100,000.

Against that, the costs. Build and integration: say €120,000 in year one. Recurring annual cost (inference, hosting, monitoring, maintenance, a share of change management): on the order of €30,000. Year one: roughly €65,000 of gain against €150,000 of cost, so negative. No surprise there. Over three years: about €195,000 in cumulative gains against €210,000 in costs. The project roughly breaks even, not the triumph on the slide. And that’s exactly what you need to decide, far more than the “+30%.”

Horizon, decision threshold, and when to kill a project

Set the horizon before you start, not after you’ve seen the numbers. For most enterprise AI use cases, 2 to 3 years is reasonable: long enough to absorb the build, short enough that a vendor or a model won’t make the solution obsolete. Beyond 4 or 5 years, your inference-cost assumptions aren’t worth much.

Define the decision threshold up front too: what level of realised gain, by what date, justifies continuing, expanding, or stopping. And give yourself permission to kill. A project that plateaus well below its target, whose adoption won’t lift off, or whose human-correction cost cancels the gain, should be stopped. Killing it early isn’t failure. It’s investment discipline. The real waste is the project kept on budgetary life support because no one dares say out loud that it earns nothing.

A defensible ROI is never a round number announced to the committee. It’s a metric tied to money, a baseline taken before, every cost counted, and a range you adjust against reality. If you’re scoping a project on those terms, our diagnostic & scoping offer lays out the assessment and the calculation in a few days, and you’ll find the prioritisation method in our guides & resources.


Move from experimentation to AI in production

Start with a short, fixed-price assessment: maturity, high-ROI use cases, and a prioritised roadmap. No commitment.