AI Solutions

Data Contracts Are the New API Contracts – and Most Enterprise Data Teams Are Behind

Schema-on-read silently became the industry's largest source of data incidents. A practical guide to putting contracts between producers and consumers.

24 March 202611 min readBy the DataX Power team

Contracts and paperwork with a pen on a desk representing data agreements

Why schema-on-read stopped working

Schema-on-read was the right compromise for 2015. It let data teams ingest faster than producers could standardise, collapsed the cost of exploratory analytics, and was the foundation of the data-lake era. By 2020 the downsides were visible. By 2026 they dominate the incident log of every mature data organisation: silent schema drift, subtle field-level semantic changes, and the uncomfortable fact that the most expensive dashboards in the company routinely run on data that nobody has contractually agreed to keep stable.

The shift underway is a return to contracts – but applied at the right layer. Data contracts make the producer-consumer relationship explicit at the interface between operational systems and the data platform. They are not a database schema, not an OpenAPI spec, not a dbt test. They are an agreement that says: these fields, with these types and these semantics, with these update cadences and these null guarantees, will be present until we negotiate a change.

What a data contract actually specifies

A useful data contract has six elements. Teams that ship fewer find themselves re-litigating the contract every incident; teams that ship more find themselves unable to maintain them.

Schema – field names, types, nullability, and enums. The table-stakes layer most teams start and stop at.
Semantics – what each field means in business terms, how it is computed, what it excludes. The field "revenue" means nothing until you specify whether it is gross, net, or contract-signed.
SLA – update frequency, latency, and freshness guarantees. "Daily" is not specific enough; "available in the warehouse by 07:00 UTC with at most 4-hour lag" is.
Quality thresholds – acceptable null rates, cardinality bounds, distribution checks. A field can be schema-correct and still be quality-broken.
Versioning and deprecation policy – how breaking changes are announced, how long the old schema coexists, who signs off.
Owners – a named producer owner, a named consumer lead, and a support route. Contracts without named humans dissolve inside six months.

Where to put the contract

The common mistake is treating data contracts as documentation. A contract that is not enforced is a wiki page. The enforcement point matters more than the contract format.

The pragmatic pattern we see working in 2026: keep the contract in a lightweight, machine-readable form (YAML or JSON against a schema registry such as Confluent's, or using open specs like Data Contract Specification / Open Data Contract Standard) and enforce it at two boundaries. First, at ingestion – data that does not conform rejects with a clear error, not a silent coercion. Second, at publication – the producer's contract is checked against test data in CI before changes ship. Pipelines built this way fail loud and fast, which is exactly what contracts are supposed to do.

Contracts and ML feature pipelines

The highest-value contracts for most organisations are the ones covering features consumed by ML models. A feature pipeline that silently loses a field, changes a null encoding, or drifts in distribution is the single most common cause of production-model regression – and one of the hardest to diagnose, because models keep scoring, the metrics move slowly, and nobody gets a pager.

The contract discipline for ML features is stricter than for analytics. Distribution tests must be part of the contract, not an optional afterthought. Training-serving skew – the feature computed one way for training and another for serving – is bug-of-the-year material and only catchable if the contract specifies the exact computation in one authoritative place. Modern feature stores (Feast, Tecton, Databricks Feature Store) bake some of this in; teams without a feature store need to build the equivalent discipline inside dbt or their pipeline framework.

Producer incentives are the hard part

The architectural side of data contracts is straightforward. The organisational side is where most implementations fail. Contracts put costs on producers – agreeing to a schema, handling deprecation windows, maintaining quality checks – and benefits on consumers. Without a counter-incentive, producers resist.

The patterns that consistently work in enterprise environments: treat the contract as part of the data product's SLA, not as a favour to downstream teams; assign the contract's quality score a visible place in the producer team's dashboard; and hold a monthly review where breaking changes require a formal deprecation plan before merge. The teams that institutionalise this stop having the annual "why did the dashboards break" incident; the teams that skip it never do.

An adoption path that does not boil the ocean

Most organisations do not have the political or engineering capacity to put contracts on every dataset at once. A pragmatic 90-day adoption looks like this.

Pick three datasets. Not one (too small to change the culture), not everything (too big to land). Criteria: high business value, multiple consumers, and a known history of incidents.
Write contracts with both producer and top-two consumers in the room. Not in a doc-review cycle – in a 90-minute working session where disagreements surface immediately.
Enforce at ingestion and at publication. Make the contract checks part of the pipeline deploy; fail the build on drift.
Publish a dashboard: contract compliance per dataset, incident count before and after, consumer satisfaction. Make the case visible.
Expand to the next tier once the first three datasets have run clean for a quarter. Culture change needs proof before scale.

Where this is going

The direction of travel through 2026 is the emergence of a shared contract vocabulary across tools. The Data Contract Specification and Open Data Contract Standard have converged meaningfully over the past year; most modern catalogues (dbt, Atlan, DataHub, Unity Catalog) now consume some variant of these as first-class metadata. That convergence will matter: it turns "data contracts" from a custom per-org artefact into a portable part of the data product definition, the same way OpenAPI turned API specs from a docs problem into an ecosystem.

Organisations that get there first will have a quieter on-call rotation, more predictable ML pipelines, and a dramatically shorter cycle time between "the business logic changed" and "the data platform caught up." The ones that stay on schema-on-read by default will keep paying the cost they have been paying for a decade, only now measured against peers who have stopped.

Back to all posts

Keep reading

Modern Hanoi office tower at dusk, evoking Vietnam's growing tech-services sector

Data Annotation Service

Top 5 Data Annotation Service Providers in Vietnam (2026)

Vietnam has emerged as a strategic destination for AI training data, offering cost advantages and a skilled workforce. This ranking evaluates the top annotation providers based on capacity, quality, security, and international track record.

Rows of server racks with status lights, evoking the data infrastructure that underpins modern ML pipelines

Data Annotation Service

The Cost of Bad Labels: Why Annotation Quality Decides AI ROI

A 2021 MIT study found measurable label errors in every one of ten classic ML benchmarks – ImageNet, MNIST, CIFAR-10, and more. The implications for enterprise pipelines are larger than the headlines suggest.

Ready to Get Started

Let's build what's next

Share your challenge – AI, data, or infrastructure. We'll scope your project and put the right team on it.

Start a Conversation See Case Studies