Why the old maturity models stopped working
The AI/MLOps maturity models that circulated in 2020–2022 were correct for the workloads of their time. They assumed training-serving split, model registries, feature stores, monitoring for drift, and pipelines that looked like CI/CD for tabular classifiers. For that world, they mostly still hold up at the level of architectural diagrams.
For the world enterprise ML teams actually inhabit in 2026 – a mix of classical tabular ML, LLM-backed workflows, agentic systems, edge inference, and increasing regulatory scrutiny – those models leave too much unsaid. The classical-ML maturity criteria still matter; the new categories of LLM evaluation, agent observability, FinOps cost attribution, regulatory traceability, and human-in-the-loop governance have to be added explicitly rather than treated as future work.
The maturity model that follows is our current read on what separates the teams shipping AI reliably from the teams that keep getting surprised. It is structured as five levels, with a per-axis scoring framework underneath so organisations can identify the specific mismatch between their dimensions rather than score themselves against a single composite number that hides the failure mode.
Level 1 – Ad hoc
Notebooks running in production, models shipped by hand, experiments tracked in a shared drive. One engineer knows how to retrain, and when they go on leave the model stays stale. No evaluation beyond the developer's own spot checks. No versioning of anything. Often no awareness that this is a problem until an incident exposes it.
Operational signs you are at Level 1: nobody can reproduce the current production model from scratch. "The model works fine" is a belief rather than a measurement. Model updates take a week of manual work per iteration. Deployment is a single engineer running scripts from their laptop. Documentation, when it exists, is out of date by the time the next change ships.
Level 1 is consistently more common in 2026 than published industry surveys suggest. The visibility issue is the structural problem: organisations at Level 1 do not have the observability to know they are at Level 1, and the audit trail to demonstrate it to a regulator does not exist.
Level 2 – Reproducible
Models can be rebuilt from versioned code plus versioned data. Experiment tracking is consistent across the team using an established tool. A model registry exists and is the source of truth for "what is in production". Deployments are still mostly manual but predictable, with a documented runbook. Monitoring is limited to infrastructure metrics (latency, error rate, throughput) with little ML-specific observability beyond the platform layer.
Operational signs you are at Level 2: any ML engineer on the team can answer "what is running in production and how was it built?" in under an hour. Historical experiments can be re-run on demand. Deploys happen on a weekly rather than ad-hoc cadence. New team members can be onboarded in days rather than weeks because the reproducibility discipline produces working documentation as a side effect.
Level 3 – Continuous integration
ML pipelines are CI-tested with the same rigour as application code. Data quality checks run automatically on ingestion. Training and evaluation are automated on a documented cadence rather than triggered manually. A feature store (for tabular workloads) or a clearly-governed embedding and retrieval stack (for LLM workloads) is in place and treated as a first-class platform asset. Canary and shadow deployments are standard rather than exceptional. Basic model monitoring – drift, performance against gold panels, prediction distribution – is wired into the platform observability layer.
For LLM-backed systems at Level 3, evaluation suites exist and run on every release with quality gates. System prompts are versioned in source control alongside the application code that consumes them. Tool catalogues are governed with allowlisting and per-tool ownership. Per-request cost and latency telemetry exists and feeds the same observability stack as the rest of the platform, so engineers can see "this query cost three cents and took 800ms" rather than working from aggregate vendor invoices.
For agent-based systems at Level 3, per-step trajectory metrics (steps taken, tools called, retries, termination correctness) are tracked alongside aggregate quality. The agent is debuggable in production rather than opaque.
Level 4 – Continuous deployment
Automated retraining triggered by quality signals or schedule, with evaluation gates that block deploys on regression. Traffic shifting managed by policy (canary ramp-up on success metrics, automatic rollback on anomaly thresholds). Feature pipelines under producer-consumer contracts. Training-serving skew actively monitored as a first-class metric. GPU resources scheduled through a proper queueing system rather than first-come-first-served, with fair-share, priority, and pre-emption policies that match the organisation's workload mix.
For LLM systems at Level 4, evaluation suites include production-sampled inputs rolled back into the regression set weekly. Model swaps (moving between providers, swapping frontier to fine-tuned small model) are a planned operational procedure with a documented playbook, not a multi-week project. Shadow-mode evaluation on live traffic catches regressions before users do. Cost attribution per feature is available to product managers, not only to platform engineers, so AI cost decisions can be made at the product level where the revenue signal is.
For agent systems at Level 4, the trajectory metrics from Level 3 are tied into the deployment gate – degradation in trajectory quality (longer tool chains, increased retry rate, unusual termination patterns) blocks promotion the same way classical-model accuracy regression does.
Level 5 – Closed-loop and governed
What Level 4 does mechanically, Level 5 does with the governance to survive regulatory scrutiny and the organisational design to keep velocity high through that scrutiny. ML and AI systems have named owners with on-call rotations. Risk assessments are refreshed per material change, not per quarter. Documentation is generated from the pipeline as a side effect of operations, rather than maintained by hand. Retraining decisions are auditable. The organisation can answer a regulator's "show me how you control this system" without a multi-week fire drill.
On the technical side, Level 5 adds: automated red-teaming on a schedule rather than only pre-launch; automated fairness and bias monitoring across protected dimensions with documented thresholds; a clear, documented path for human-in-the-loop override on any high-stakes decision; and a cross-team catalogue where features, embeddings, prompts, models, and agent definitions are governed with lineage, ownership, and change review.
On the regulatory side, Level 5 produces the audit-ready evidence pipeline that EU AI Act Article 9–15 requirements, NIST AI Risk Management Framework adoption, ISO/IEC 5259 data quality compliance, and APAC personal-data-protection regimes increasingly require. Retrofitting these evidence artefacts onto Level 3–4 systems takes months; building them in at Level 5 is operational baseline.
Level 5 is rare in 2026. Financial-services and healthcare teams operating under binding supervision are closest. Most other enterprise organisations sit between Level 3 and Level 4, with significant per-axis variation in which sub-areas have advanced and which lag.
The five-axis self-assessment that actually helps
Instead of scoring the organisation against the five levels as a single composite number, score across five axes independently. Most organisations are not uniform – they might be Level 4 on deployment and Level 2 on governance, or Level 3 on classical ML and Level 1 on LLM operations. The mismatch is usually where the next major incident will come from, and the per-axis view surfaces it where the composite score hides it.
- Reproducibility – can you rebuild any production model from versioned code and data, on demand, within an hour? Score the axis on whether the answer is "yes for every model" or "yes for some" or "no".
- Evaluation – do you have a regression suite that fails closed on quality regression, with calibrated LLM-as-judge metrics, a documented failure-mode catalogue, and a production-sample loop that keeps the evaluation distribution calibrated against actual traffic?
- Deployment – are promotions gated on evaluation, with canary ramp-up and automatic rollback fully automated, and a documented model-swap procedure that does not require a multi-week project?
- Observability – do you have ML-specific monitoring (drift, quality, cost per call, latency p95 and p99, per-step agent trajectory metrics) alongside the standard infrastructure stack? Can a product manager see cost-per-AI-feature without engineering help?
- Governance – is there a named owner per system with an on-call rotation, an audit trail across model and prompt changes, and a risk-assessment process that runs per material change rather than per quarter?
The highest-leverage upgrade per tier
For most organisations, one specific investment dominates the path to the next maturity level. The upgrade-sequencing matters: skipping ahead to a higher level on one axis without consolidating the foundation produces brittle systems that fail in ways the team is not yet equipped to diagnose.
- Level 1 → 2. Adopt a single experiment-tracking tool and stop accepting notebooks as a deploy artefact. The cultural change is what costs; the technical change is small.
- Level 2 → 3. Build the evaluation regression suite and wire it into CI. Nothing else moves as many downstream metrics – deployment frequency, regression-detection rate, incident MTTR all lift on the same investment.
- Level 3 → 4. Replace first-come-first-served GPU allocation with a proper queueing system, and set up gated deploys with automated rollback. Without these two, scaling AI traffic produces operational instability that no amount of model engineering compensates for.
- Level 4 → 5. Add the governance layer – named owners, change review, documented risk assessments, audit-ready evidence pipeline. It is the least exciting investment and the most audit-durable, and the only path through regulator scrutiny on a manageable timeline.
The 2026 reality check
Most enterprise organisations we advise in 2026 are at Level 3 on classical ML and Level 1–2 on LLM-backed systems, because the LLM operations stack is newer and the team muscle is still forming. That mismatch is normal and is also the single biggest risk-carrier right now, because LLM workloads are where the reputational and regulatory exposure is highest. A Level-2 LLM operation running on top of a Level-3 platform produces exactly the kind of silent regressions that surface as front-page news incidents.
Closing that specific gap – bringing the LLM operations capability up to the same maturity that classical ML already has – is the single highest-ROI infrastructure investment in 2026 for most mid-sized enterprise AI organisations. The investment items have been outlined across the rest of this site: evaluation suites, system prompting discipline, data contracts, FinOps for AI workloads, GPU scheduling, and the regulatory documentation pipeline. This maturity model is the sequencing guide; each individual area has its own deeper playbook.
The organisations that get to Level 4 across both classical ML and LLM operations in 2026 will operate AI products with materially lower incident rates, faster iteration cycles, and audit-ready posture for the regulatory environment that is tightening through 2026–2027. The organisations that stay at Level 2 on the LLM side will keep absorbing operational and reputational cost that the Level-4 peers do not.
Operational metrics that distinguish levels in practice
A useful shortcut for self-assessment: track six operational metrics that consistently distinguish higher-maturity from lower-maturity AI operations.
- Deployment frequency. Level 2 ships weekly; Level 3 ships several times per week; Level 4 ships on demand with automated gates. The cadence trend is the most visible single signal of platform maturity.
- Lead time for a model change. From "we want to retrain" to "the change is in production". Level 2 measures this in weeks; Level 4 measures it in hours.
- Change failure rate. Percentage of deployments that produce a measurable quality regression in production. Level 3 sits in the 15–25% range; Level 4 drops below 10% because evaluation gates catch regressions earlier.
- Mean time to recovery. From incident detection to remediation. Level 2 sits in the multi-hour range; Level 4 with automated rollback drops to sub-hour for common regression patterns.
- Cost per AI feature. The ability to attribute spend per product feature rather than aggregate per platform. Higher-maturity organisations expose this metric to product managers; lower-maturity organisations cannot answer the question at all.
- Audit-ready evidence response time. From "regulator asks how this system makes decisions" to "we can show the documentation". Level 2 takes weeks; Level 5 takes hours because the documentation is generated from the pipeline as a side effect.
Frequently asked questions
Common questions raised by platform and engineering leaders running this self-assessment:
- How do I score classical ML separately from LLM operations? Run the five-axis assessment twice – once for the classical-ML side, once for the LLM side. Most organisations find a 1–2 level gap between the two, and the gap is the priority work.
- Is Level 5 realistic for non-regulated enterprises? The technical components of Level 5 are achievable; the governance overhead may not be justified for organisations without regulatory exposure. Level 4 with selected Level 5 components (named owners, audit-ready documentation) is the operational sweet spot for most mid-sized enterprises.
- How long does it take to move one level? Roughly 6–9 months from Level 2 to 3, 6–12 months from Level 3 to 4, and 12–18+ months from Level 4 to 5. The transitions are not linear; the cultural and organisational work dominates the timeline beyond the technical work.
- Which axis should I prioritise first? Whichever axis is the most-likely source of the next production incident. For most organisations in 2026, that is the LLM-operations evaluation axis – which is why "build the evaluation regression suite" is the most common Level 2 → 3 upgrade.
- How does this interact with our DevOps maturity? Strongly. Higher DevOps maturity (CI/CD discipline, observability infrastructure, incident-response cadence) directly lifts MLOps maturity by providing the foundation. Organisations with weak DevOps cannot durably reach high MLOps levels; the foundation work has to happen first.


