The AI Data Readiness Checklist — 28 Questions Before You Train, Fine-Tune, or RAG
A pragmatic checklist across data quality, lineage, access, governance, and freshness — built for teams that need a defensible answer before their first AI project ships.
Most organizations discover data problems three weeks after a project has started. A data engineer flags missing fields. Legal raises a question about personal data in the training set. Nobody can trace where a specific record came from.
This checklist surfaces those problems before they become project risks. It covers seven dimensions of AI readiness that separate organizations that ship reliable AI from those that don't. Work through the 28 questions, score each dimension, and you have a defensible baseline to share with your AI lead, legal team, and, if necessary, a regulator.
For the broader organizational picture, see the full AI readiness assessment.
What "data readiness for AI" actually means (and where most checklists go wrong)
Most checklists ask whether your data is "clean". That question is necessary but not sufficient. A dataset can pass every quality check and still be unfit for AI deployment — no lineage documentation, access controls not designed for model training pipelines, personal-data fields that create GDPR exposure the moment they enter a training loop, schema drift undetected for six months. None of these show up on a quality report.
The AI readiness pillars framework treats data readiness as one of five organizational pillars. Within the data dimension, seven sub-dimensions matter:
- Data quality (completeness, consistency, accuracy, timeliness)
- Data lineage
- Access controls and identity
- Labeling and ground truth
- Governance metadata
- Freshness and pipeline reliability
- Vector and unstructured-data readiness
The last — governance metadata — is where most vendor checklists fail. They treat governance as a downstream concern. The EU AI Act treats it as a precondition: Article 10 requires data governance, including category classification and bias assessment, before high-risk systems are trained. This checklist is built around that sequence.
Dimension 1 — Data quality
Completeness, consistency, accuracy, timeliness — with numeric thresholds
Poor data quality is the most commonly cited cause of AI project failure, but "poor quality" is not a useful diagnosis. These four attributes each require a threshold, not just a description.
- Completeness: Critical fields have a null rate below 5%. Fields at 5–20% null have documented imputation logic. Fields above 20% null are excluded or flagged.
- Consistency: Cross-system comparison shows fewer than 2% conflicting values for shared entities.
- Accuracy: A sample validation of at least 500 records against a trusted reference source has been completed in the last 90 days. Accuracy rate is documented.
- Timeliness: Maximum acceptable data lag between source event and pipeline availability is defined, monitored, and within threshold.
Schema drift detection
Schema drift — columns added, renamed, or removed without notification — is one of the most common causes of silent model degradation.
- A schema change detection mechanism covers all upstream data sources.
- Schema changes alert the AI pipeline team, not just the data team.
- The last 12 months of schema change history is documented and reviewed.
Dimension 2 — Data lineage
From source system to model input
Lineage answers: can you trace any record in your training or inference dataset back to its origin? Without it, a hallucination post-mortem is guesswork.
- Every dataset used in AI training or inference has a documented source system, extraction method, and transformation history.
- Lineage is machine-readable, not just a slide diagram. A query on any record returns its full provenance chain.
- Source system changes trigger automatic assessment of impact on downstream AI datasets.
- Lineage documentation is stored with the model artefact, not in a separate wiki that may drift out of sync.
Why lineage is now an AI Act requirement, not a nice-to-have
Under the EU AI Act, providers of high-risk systems must document the origin, collection method, and processing applied to training data (Article 10). Without lineage tooling, this requires manual reconstruction — expensive and unreliable. If your system is challenged post-deployment, lineage is the primary evidence that governance met the standard. The AI compliance checklist covers Article 10 alongside the full set of high-risk obligations.
Lineage also affects AI Act high-risk systems classification: systems without clean data provenance may face upward reclassification.
- Article 10 data documentation requirements are assigned to a named owner.
- Lineage covers not just raw data but enrichments, joins, and derived features added during preprocessing.
Dimension 3 — Access controls and identity
Access readiness belongs in this checklist because access gaps create two distinct AI risks: training on data the model should never have seen, and inference-time leakage through retrieval.
- Role-based access controls on data sources used for AI are documented and current. Access rights reflect least privilege.
- Service accounts used by AI pipelines have dedicated identities with scoped permissions — not shared credentials or admin accounts.
- Access to sensitive data fields requires explicit approval and is logged for audit.
- When employees leave or change roles, access rights are reviewed within 5 business days.
- For RAG systems: the retrieval layer enforces the same access controls as the underlying data store. A user cannot retrieve a document through the AI that they could not access directly.
Dimension 4 — Labeling and ground truth
Labeling readiness applies to supervised learning and fine-tuning. For RAG systems, ground truth means the accuracy and currency of the document corpus.
- The labeling process is documented: who labeled, what guidelines were used, what inter-annotator agreement was achieved.
- Label quality has been audited on a sample. Error rate is below the acceptable threshold for the task (3% is a reasonable baseline for classification).
- For fine-tuning: training, validation, and test sets are drawn from non-overlapping time windows or sources to prevent leakage.
- Ground-truth labels are versioned. The label version used to train each model version is recorded.
- For RAG: the document corpus has a defined refresh cadence. Stale documents are removed or flagged.
Dimension 5 — Governance metadata
Governance metadata separates a dataset that passes a quality check from one that passes a regulatory audit. The questions below are shaped by Article 10 of the AI Act and standard GDPR obligations.
Get the full AI readiness assessment, including data dimension scoring and governance gap analysis → /tools/ai-readiness-assessment
Data category classification
- Every field in datasets used for AI is classified: personal data, special category personal data (Article 9 GDPR), commercially sensitive, publicly available, or non-sensitive.
- Classification is stored as metadata on the dataset, not only in documentation.
- A process exists to re-classify fields when sensitivity changes (e.g., a derived field found to be re-identifiable).
Personal-data flags and DPIA hooks
- Personal data fields are flagged in the schema. AI pipelines consume this flag and apply appropriate handling — pseudonymization, exclusion, or legal-basis check.
- For systems involving large-scale personal data processing, systematic profiling, or special category data: a DPIA has been completed and is linked to the dataset record.
- Legal basis for processing personal data in each AI use case is documented (consent, legitimate interest, contractual necessity, legal obligation).
Retention and right-to-erasure
- Retention schedules are defined for all AI training datasets. Data that has exceeded its retention period is deleted or anonymized.
- A process exists to assess and respond to erasure requests that affect training data — including model retraining where required.
- Logs from AI inference pipelines (including RAG retrieval logs) are subject to the same retention controls as other personal data.
Dimension 6 — Freshness and pipeline reliability
A model trained on accurate, well-governed data degrades rapidly if the pipeline feeding it becomes unreliable after deployment.
- Freshness SLAs are defined per dataset and per AI use case. Batch analytics freshness is not the same threshold as a real-time recommendation system.
- Pipeline failures are detected automatically. An alert fires within 15 minutes of an outage that would cause the system to operate on stale data.
- The AI system has a defined behavior for stale-data conditions: fallback response, confidence flag, or graceful degradation. This behavior is tested.
- Observability covers volume anomalies, not just failures. A 40% drop in daily record volume triggers review even if the pipeline technically ran.
- Dependency on third-party data providers is documented. Provider SLA commitments are reviewed against the freshness requirements of each AI system.
Dimension 7 — Vector and unstructured-data readiness
A RAG system with a poorly maintained vector store produces confident-sounding hallucinations. Retrieval quality is as consequential as generation quality.
- The chunking strategy is documented: chunk size, overlap, and rationale relative to the retrieval task.
- Embeddings are versioned and linked to the model version that produced them. A re-embedding process is defined for when the embedding model changes.
- Metadata on each vector record (source, creation date, author, access classification) is complete. Retrieval without this metadata cannot enforce access controls or provide attributable citations.
- Deduplication logic is applied to the corpus.
- Corpus coverage is measured: the percentage of expected production queries answerable from the current corpus is benchmarked.
How to score your dataset (1–5 per dimension, with examples)
Score each dimension on a 1–5 scale:
| Score | Meaning |
|---|---|
| 5 | All checks passed, documented, monitored continuously |
| 4 | All checks passed, documented, not yet automated |
| 3 | Most checks passed; minor gaps under active remediation |
| 2 | Material gaps; remediation plan not yet committed |
| 1 | Dimension not assessed or majority of checks failed |
28–35: Production-ready for low-risk workloads. 21–27: Pilot-ready with documented risk acceptance. 14–20: Pre-pilot remediation required. 7–13: Foundation work needed before any AI project starts.
Dimension 5 (governance) is weighted. A score below 4 on Dimension 5 blocks high-risk system deployment regardless of the overall total.
The AI readiness checklist applies this scoring across all five readiness pillars — strategy, talent, infrastructure, data, and governance — and produces a prioritized remediation map.
Frequently Asked Questions
What's the difference between AI readiness and data readiness?
AI readiness covers strategy, talent, infrastructure, process, and data. Data readiness is one dimension of that broader picture — specifically whether the data foundation can support the AI workloads the organization intends to run. A team can have strong strategy, skilled people, and solid governance and still have data gaps that block a specific project. Run both assessments together, not sequentially.
Do we need a data catalog before we can call ourselves "AI-ready"?
Not necessarily, but the functions a catalog provides — field-level classification, lineage tracking, ownership assignment, discovery — are all prerequisites for serious AI deployment. Whether you implement those functions through a dedicated catalog product, a data warehouse database of record, or structured documentation depends on scale and complexity. What matters is that the functions exist and are maintained, not the tooling choice. Organizations without any catalog typically score 1–2 on Dimensions 2 and 5.
How does data readiness change for fine-tuning vs. RAG vs. agentic systems?
The seven dimensions apply across all three paradigms, but their weight shifts. For fine-tuning, Dimensions 1 and 4 (quality and labeling) dominate — model performance is determined directly by training data quality and label consistency. For RAG, Dimensions 7 and 3 (vector readiness and access controls) take priority — retrieval quality and access enforcement at the vector layer are the most common failure modes. For agentic systems, Dimension 6 (pipeline reliability) is critical because agents executing multi-step workflows are highly sensitive to freshness failures and schema changes propagating through tool calls.
Is GDPR compliance enough, or does AI add new data-readiness requirements?
GDPR is necessary but not sufficient. The EU AI Act adds Article 10 requirements for training data governance — provenance, bias assessment, relevance documentation — that have no direct GDPR equivalent. For automated decision-making, Article 22 GDPR and Article 14 AI Act apply simultaneously with overlapping but not identical obligations. AI systems can also create re-identification risks from data previously considered anonymized. Dimension 5 of this checklist captures the obligations at the intersection of GDPR and the AI Act.
What's the minimum data-readiness score to greenlight a first AI project?
For low-risk, internal-use workloads a score of 21–27 is sufficient, provided risks on failed checks are explicitly accepted by an accountable owner. For customer-facing deployments or any system near the AI Act's high-risk categories, the bar is 28–35 overall with a minimum of 4 on Dimension 5. Greenlighting a high-risk system with a Dimension 5 score below 4 is not a data-readiness decision — it is a compliance exposure decision that requires legal sign-off.
Score your data readiness in under 20 minutes — Get the free assessment