Data Annotation Service

The Cost of Bad Labels: Why Annotation Quality Decides AI ROI

Label errors hide in plain sight, even in canonical benchmarks. Here is what they cost, how to measure them, and what the data-centric movement says you should do about it.

04 May 202610 min readBy the DataX Power team

Rows of server racks with status lights, evoking the data infrastructure that underpins modern ML pipelines

The benchmarks were never as clean as we thought

In 2021 Curtis Northcutt and collaborators at MIT released "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks" – a study that systematically audited the test sets of ten of the most-cited datasets in machine learning, including ImageNet, CIFAR-10/100, MNIST, QuickDraw, AudioSet, IMDB, and Amazon Reviews. They estimated an average label-error rate of 3.4% across these test sets, with ImageNet sitting at roughly 5.8%.

The headline finding was simpler: in many cases, the model that was most accurate on the noisy ground truth was not the model that was most accurate on the cleaned ground truth. A handful of label errors at the boundary between two classes can flip the leaderboard. If even the canonical benchmarks have hidden a 1-in-20 label-error rate for years, the realistic prior on a hand-labelled enterprise dataset is rarely better.

Andrew Ng's data-centric reframe

The same year, Andrew Ng began arguing publicly for a "data-centric" rather than "model-centric" view of ML engineering. His central claim, repeated across DeepLearning.AI's "The Batch" newsletter and his Stanford talks: for many real-world problems, holding the data fixed and iterating on the model captures less performance than holding the model fixed and iterating on the data. He pointed at industrial defect-detection projects where a 1% absolute accuracy improvement came from cleaning labels rather than from any architectural change.

Stanford's AI Index, published annually by HAI, has since tracked the same trend at industry scale: the highest-performing production systems are the ones whose teams invest disproportionately in data quality, evaluation pipelines, and labelling protocols, not just in larger models. The practical implication for any team budgeting an AI program is that label quality is not a cost line under "data ops" – it is a performance line under "model accuracy".

Where label noise actually comes from

In the engagements we run, the dominant sources of label noise are not annotator carelessness. They are structural:

Ambiguous schemas. Two annotators trained on the same guideline but disagreeing 12% of the time on a borderline class is a guideline problem, not a labour problem.
Concept drift. The rules in week one of a project will not survive contact with month six of production traffic. Without a re-labelling cadence, the dataset silently misaligns from reality.
Class imbalance. The rarest classes are the ones where mislabels hurt most, and the ones least audited under standard sampling.
Tooling friction. UIs that make it easy to slip on a hotkey, or that hide adjudication history, manufacture errors that look like annotator failure but are actually interface failure.

The metrics that catch label noise early

The MIT team's open-source tool, Confident Learning (released as the Cleanlab library), has become a common pattern: train a baseline model, use its predicted probabilities to estimate which labels are most likely wrong, and route those samples back for re-review. It surfaces the same kind of issues that a careful inter-annotator agreement programme would, much earlier in the pipeline.

Inter-annotator agreement metrics – Cohen's and Fleiss' kappa for categorical labels, Krippendorff's alpha for ordinal or interval, F1 against a gold panel for span and bounding-box tasks – remain the durable instruments. The number that matters in operations is rarely the agreement on the easy 80% of examples; it is agreement on the hard 20% where guidelines are weakest.

What "good enough" looks like in production

Across regulated and safety-critical domains – healthcare, autonomous driving, financial fraud – we typically target 99%+ field-level accuracy on a stratified gold set, with a documented two-pass workflow and clinician or domain-expert sign-off on the decision boundary. For internal-tooling and content-moderation work, the bar is usually 95-97% and IAA-driven, with active learning routing only the uncertain examples to humans.

The pattern across both is the same: the dataset is treated as a versioned engineering artefact, not a one-shot deliverable. Each release ships with a labelling guideline, a sampled QA report, IAA scores by class, and a delta against the previous version. That discipline turns "data" into something product, ML, and regulators can all reason about.

Where DataX Power fits

Our annotation practice is built around exactly this loop: domain-trained annotators, IAA-driven QA, Confident-Learning style audits before release, and clinician or specialist sign-off where regulation demands it. If you are sizing an annotation budget for the year ahead, the cost we usually see teams under-estimate is not the labelling – it is the cost of shipping a dataset whose error profile they cannot defend in front of a model-risk committee or a regulator.

Back to all posts

Keep reading

Modern Hanoi office tower at dusk, evoking Vietnam's growing tech-services sector

Data Annotation Service

Top 5 Data Annotation Service Providers in Vietnam (2026)

Vietnam has emerged as a strategic destination for AI training data, offering cost advantages and a skilled workforce. This ranking evaluates the top annotation providers based on capacity, quality, security, and international track record.

Mixed-language signage in a Southeast Asian city street – evoking the multilingual reality of APAC text data

Data Annotation Service

Annotating Low-Resource APAC Languages: Where Off-the-Shelf Stops Working

Frontier models still degrade noticeably on most APAC languages. The fix is not more compute. It is in-language, in-region annotation – built around the cultural specifics that translation pipelines flatten.

准备好了吗?

携手打造下一个里程碑

告诉我们您的挑战 – AI、数据或基础设施。我们将为项目梳理范围,并为您配置合适的团队。

开启对话查看客户案例

The Cost of Bad Labels: Why Annotation Quality Decides AI ROI

The benchmarks were never as clean as we thought

Andrew Ng's data-centric reframe

Where label noise actually comes from

The metrics that catch label noise early

What "good enough" looks like in production

Where DataX Power fits

Keep reading

Top 5 Data Annotation Service Providers in Vietnam (2026)

Annotating Low-Resource APAC Languages: Where Off-the-Shelf Stops Working

携手打造 下一个里程碑

携手打造下一个里程碑