Inter-Annotator Agreement: The Metric That Should Govern Your Labelling Budget

IAA is the lever that distinguishes "labels we can train on" from "labels we are guessing about". Here is how to pick the right statistic, what counts as good, and how to operationalise it.

9 min readBy the DataX Power team
Laptop displaying analytics dashboards – evoking the metrics-driven view of annotation operations

Why agreement is the metric, not accuracy

Accuracy presumes a known ground truth. In most enterprise annotation work, ground truth is exactly what you are trying to manufacture. Inter-annotator agreement (IAA) is the closest honest substitute: a measure of whether two or more independent reviewers, given the same example and the same guideline, would assign it the same label.

The reason it matters is structural. A model trained on a dataset where reviewers disagree 20% of the time has a 20% performance ceiling baked in – no architectural change can take it higher than the noise floor. The most cost-effective AI investment in many organisations is not a new model; it is dragging IAA on the ambiguous classes from 0.65 to 0.85.

Pick the right statistic for the task

IAA is a family, not a single metric. Picking wrong looks fine on a dashboard and fails in audit:

  • Cohen's kappa – two annotators, categorical labels. Corrects for chance agreement. Standard reading: <0.40 poor, 0.41-0.60 moderate, 0.61-0.80 substantial, >0.81 near-perfect (Landis and Koch, 1977).
  • Fleiss' kappa – more than two annotators, fixed set of categories. Used widely in radiology and pathology where panels of three or more clinicians label the same study.
  • Krippendorff's alpha – the most flexible of the family. Handles missing data, ordinal and interval scales, any number of annotators. Klaus Krippendorff's "Content Analysis: An Introduction to Its Methodology" remains the canonical reference. Hugging Face and DeepLearning.AI both recommend alpha as the default for mixed annotation programmes.
  • F1 / IoU against a gold set – for span, segmentation, and bounding-box tasks where chance agreement is hard to define. Pair with a stratified gold panel reviewed by senior annotators.

What "good" actually means

A 0.85 alpha on a balanced classification task can hide a 0.50 alpha on the rarest and most consequential class. The number to publish in your QA report is not a single headline IAA – it is IAA per class, plus disagreement clusters that tell you which guideline rules need attention.

In regulated domains the bar is higher and the metric expands. The BraTS brain-tumor segmentation challenge, run annually since 2012, requires multiple expert raters per case and reports Dice scores against an aggregated reference, with explicit treatment of inter-rater variability. Most clinical AI submissions to the FDA and EMA now include some form of inter-rater agreement evidence as part of the data-quality narrative.

Operationalising IAA without breaking the budget

Measuring agreement on every example is expensive. The pattern that works in production:

  • Run a 10-15% IAA sample on every batch, stratified by class so rare categories are not under-sampled.
  • Treat disagreement as a guideline signal first, an annotator signal second. If two trained reviewers disagree on the same example three times in a week, the example is the bug.
  • Maintain a versioned gold panel of 200-1,000 adjudicated examples. Run new annotators against it at onboarding and recurringly thereafter; report drift over time.
  • Surface IAA in the model evaluation pipeline. If the model and the panel disagree, note whether the panel itself disagreed – that prevents you from treating noisy ground truth as a model regression.

The hidden cost of skipping it

The most expensive failure mode we see is not low IAA. It is no IAA. Teams ship a dataset with no per-class agreement number, train a model on it, find that the model under-performs in production, and re-run the entire labelling cycle to fix it. In domains under regulator scrutiny, the cost compounds: model-risk reviewers, EMA inspectors, and SOC 2 auditors all increasingly ask for evidence that the labels themselves were measured.

The NIST AI Risk Management Framework treats data quality and traceability as first-class controls. ISO/IEC 5259 (the data-quality standard for AI) goes further and explicitly enumerates inter-annotator agreement as a measurable property of a dataset. Programmes set up to comply with either are programmes that will not have to retrofit IAA into a labelling pipeline at the worst possible moment.

Where DataX Power fits

Every annotation engagement we run ships with per-class IAA, a versioned gold panel, and a disagreement-cluster report alongside the labels. For clients heading into FDA, MDR, or SOC 2 review, that report becomes part of the submission. If your team is currently running labelling without an IAA programme, that is the first thing we usually fix.

Let's build what's next

Share your challenge – AI, data, or infrastructure. We'll scope your project and put the right team on it.