The benchmarks were never as clean as we thought
In 2021 Curtis Northcutt and collaborators at MIT released "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks" – a study that systematically audited the test sets of ten of the most-cited datasets in machine learning, including ImageNet, CIFAR-10/100, MNIST, QuickDraw, AudioSet, IMDB, and Amazon Reviews. They estimated an average label-error rate of 3.4% across these test sets, with ImageNet sitting at roughly 5.8%.
The headline finding was simpler: in many cases, the model that was most accurate on the noisy ground truth was not the model that was most accurate on the cleaned ground truth. A handful of label errors at the boundary between two classes can flip the leaderboard. If even the canonical benchmarks have hidden a 1-in-20 label-error rate for years, the realistic prior on a hand-labelled enterprise dataset is rarely better.
Andrew Ng's data-centric reframe
The same year, Andrew Ng began arguing publicly for a "data-centric" rather than "model-centric" view of ML engineering. His central claim, repeated across DeepLearning.AI's "The Batch" newsletter and his Stanford talks: for many real-world problems, holding the data fixed and iterating on the model captures less performance than holding the model fixed and iterating on the data. He pointed at industrial defect-detection projects where a 1% absolute accuracy improvement came from cleaning labels rather than from any architectural change.
Stanford's AI Index, published annually by HAI, has since tracked the same trend at industry scale: the highest-performing production systems are the ones whose teams invest disproportionately in data quality, evaluation pipelines, and labelling protocols, not just in larger models. The practical implication for any team budgeting an AI program is that label quality is not a cost line under "data ops" – it is a performance line under "model accuracy".
Where label noise actually comes from
In the engagements we run, the dominant sources of label noise are not annotator carelessness. They are structural:
- Ambiguous schemas. Two annotators trained on the same guideline but disagreeing 12% of the time on a borderline class is a guideline problem, not a labour problem.
- Concept drift. The rules in week one of a project will not survive contact with month six of production traffic. Without a re-labelling cadence, the dataset silently misaligns from reality.
- Class imbalance. The rarest classes are the ones where mislabels hurt most, and the ones least audited under standard sampling.
- Tooling friction. UIs that make it easy to slip on a hotkey, or that hide adjudication history, manufacture errors that look like annotator failure but are actually interface failure.
The metrics that catch label noise early
The MIT team's open-source tool, Confident Learning (released as the Cleanlab library), has become a common pattern: train a baseline model, use its predicted probabilities to estimate which labels are most likely wrong, and route those samples back for re-review. It surfaces the same kind of issues that a careful inter-annotator agreement programme would, much earlier in the pipeline.
Inter-annotator agreement metrics – Cohen's and Fleiss' kappa for categorical labels, Krippendorff's alpha for ordinal or interval, F1 against a gold panel for span and bounding-box tasks – remain the durable instruments. The number that matters in operations is rarely the agreement on the easy 80% of examples; it is agreement on the hard 20% where guidelines are weakest.
What "good enough" looks like in production
Across regulated and safety-critical domains – healthcare, autonomous driving, financial fraud – we typically target 99%+ field-level accuracy on a stratified gold set, with a documented two-pass workflow and clinician or domain-expert sign-off on the decision boundary. For internal-tooling and content-moderation work, the bar is usually 95-97% and IAA-driven, with active learning routing only the uncertain examples to humans.
The pattern across both is the same: the dataset is treated as a versioned engineering artefact, not a one-shot deliverable. Each release ships with a labelling guideline, a sampled QA report, IAA scores by class, and a delta against the previous version. That discipline turns "data" into something product, ML, and regulators can all reason about.
Where DataX Power fits
Our annotation practice is built around exactly this loop: domain-trained annotators, IAA-driven QA, Confident-Learning style audits before release, and clinician or specialist sign-off where regulation demands it. If you are sizing an annotation budget for the year ahead, the cost we usually see teams under-estimate is not the labelling – it is the cost of shipping a dataset whose error profile they cannot defend in front of a model-risk committee or a regulator.