The POC was flawless. Three weeks of work, a demo that pulled gasps out of the steering committee, an assistant answering correctly on the sample of contracts we had carefully prepared for it. Green light for production. Then, wired up to the real records, it fell apart: half the contracts were unreadable scans, customer numbers were written one way in the CRM and another in the ERP, and nobody could say who was allowed to use the “risk segment” field. The model hadn’t changed. The data had.
From where we sit, the cause is nearly always the same. In the field, roughly 80% of what sinks an AI project happens on the data side, upstream of the model. That’s not a sourced study, it’s a pattern we see project after project. The model itself is rarely the problem anymore; the material you feed it, almost always. Here are the six checks that decide whether your data holds up or quietly wrecks the project. A few hours to run them, months of going the wrong way avoided.
- Access: silos, permissions, formats.
- Quality: completeness, duplicates, freshness, outliers.
- Documentation: a data dictionary and shared business definitions.
- Lineage and compliance: traceability, GDPR, personal data.
- Volume and representativeness: bias, rare classes.
- Serving data in production: latency, freshness, pipelines.
1. Access: can you even reach the data?
The first question isn’t “is the data good?” but “can you get to it, legally and technically?”. In most companies we work with, the useful data is scattered: some in the ERP, some in the CRM, the rest in Excel files and inboxes. Each silo has its own owner, its own permissions, its own format. The blockers show up every single time:
- A business-critical database reachable only through a manual export, once a month, by a single person.
- Access rights that need three sign-offs and six weeks before a project can so much as read a table.
- Formats that mix the structured (clean order lines) with the unstructured (scanned PDFs, emails, free-text notes).
To score this point, list your sources and note, for each: where it lives, who is allowed to do what with it, and what format it comes out in. That list is already a data map. Most organisations don’t have one, and that’s where projects burn their first few weeks.
2. Quality: what hides behind a tidy table
Accessible data is not usable data. Quality has several dimensions, and each one sabotages a model in its own way.
Completeness
The postcode field filled in 60% of the time, the cancellation date missing while the customer is still active. A model learns your gaps beautifully: if a variable is only captured for large accounts, it will infer wrong rules for everyone else.
Duplicates
“Acme Ltd”, “Acme Limited”, “ACME LTD.”: three rows, one customer. In analysis, your counts are wrong. In training, the same case carries three times the weight it should.
Freshness and outliers
A product reference last touched eighteen months ago, still driving decisions about today’s catalogue. An order amount of 9,999,999 that is really a placeholder for “unknown”. These slip past the human eye and distort everything downstream.
The tooling for this exists and it’s mature. With dbt, you put quality tests straight into your transformations: uniqueness, not-null, accepted values. Great Expectations goes further, with explicit expectations about your datasets and a pipeline halted the moment one is violated. The goal isn’t perfect quality, it’s quality that is visible and tested rather than assumed.
3. Documentation: is everyone talking about the same thing?
This is the point everyone underrates, and the one that costs the most in meetings. Ask three people to define an “active customer”: marketing counts sign-ups, finance counts billed accounts, sales counts recent contacts. None of them is wrong, and that’s precisely the trouble.
Without a data dictionary and shared definitions, every figure your AI produces is open to dispute. An assistant that reports “12,400 active customers” triggers an hour of debate about methodology instead of a decision. A modern data catalogue answers this: each table and field, its business definition, its owner, its freshness. Add data contracts between the team producing the data and the one consuming it, and you lock down an explicit commitment on schema, types and semantics. Less exciting than a chatbot demo, far more load-bearing.
4. Lineage and compliance: where did this come from, and are you allowed?
When a figure comes out wrong, the first question is: where did it come from, and what transformations did it pass through? That’s lineage. Without it, debugging a number means groping blindly through a chain of scripts nobody owns anymore. dbt generates that dependency graph automatically, and you trace back to the source in a few clicks.
Then comes compliance. If your data holds personal information, GDPR is not optional, and the EU AI Act adds another layer. Three questions to settle beforehand, never after:
- What is the legal basis for using this data for AI, and does it hold up against the original purpose it was collected for?
- Do you know exactly which fields are personal data, and where they propagate across your pipelines?
- If a customer exercises their right to erasure, can you also remove them from training sets and indexes?
These questions feel tedious. Far less tedious than a whole project being challenged six months after launch because the legal basis was never checked.
5. Volume and representativeness: enough data, and the right kind?
“We have millions of rows” is falsely reassuring. The real question isn’t raw volume, it’s representativeness. A fraud-detection model trained on a history where 99.8% of transactions are legitimate will mostly learn to say “no fraud”, and it’ll be right 99.8% of the time while missing the whole point. The rare classes are exactly the ones that matter. Same trap with bias: if your history reflects past practice (a customer type you never targeted, a region that’s underrepresented), the model carries those blind spots forward and dresses them up as facts. It invents nothing; it photographs your data faithfully, flaws and all.
And here’s the contrarian bit: for plenty of enterprise use cases, you have more than enough data and you’re waiting for the wrong reasons. Teams postpone a project “until we’ve gathered more data” when a few thousand well-labelled examples would do the job. Volume is only a wall if you’re building a generic model from scratch. For a specific business case, representativeness and labelling quality matter far more than sheer count.
6. Serving data in production: an Excel export is not a pipeline
This is the point that separates a POC from a product, and the most common blind spot. In a demo, you load a file by hand and everything works. In production, the model needs fresh data, at the right moment, automatically, thousands of times a day. A monthly Excel export won’t honour that contract for a second. The questions that matter then:
- Latency: can your case live with data recomputed overnight, or does it need near real time?
- Freshness: at what age does a record become dangerous for the decision it feeds?
- Pipelines: who gets paged when ingestion breaks at 3 a.m., and does the model keep serving answers on stale data without telling anyone?
A warehouse like Snowflake or BigQuery gives you the foundation to store and recompute at scale; dbt orchestrates the transformations in a versioned, tested way. For an LLM case with document retrieval, you need a vector store (pgvector, say, if you’re already on PostgreSQL) and above all a pipeline that re-indexes when documents change. Otherwise your assistant will confidently cite a version of a procedure you deleted six months ago.
Start small and honest, not with a two-year programme
The temptation after this diagnostic is to launch a grand “data” programme: catalogue everything, clean everything, govern everything before a single AI use case. That’s the opposite mistake, and just as expensive as the reckless POC; a two-year project usually dies before producing anything useful.
My bias is the other way, narrow and honest. Take one use case that matters. Run its data, and only its data, through these six checks. Fix what blocks that case, ship it, repeat. You build your data foundation one case at a time, delivering value at each step. Perfect data doesn’t exist; data that’s good enough for a specific case does.
To measure this picture rather than guess at it — that’s what our modelling & data mapping offer is for, scoring each source for AI readiness. For a method to run in-house, browse our guides & resources.