Inter-Annotator Agreement in 2026: Cohen's Kappa, Krippendorff's Alpha & When to Use Each

IAA is the lever that distinguishes "labels we can train on" from "labels we are guessing about". This guide details how to pick the right statistic, what counts as a defensible score, how to operationalise the measurement without breaking the budget, and how to read a per-class agreement report the way a model-risk reviewer would.

13 min read
Laptop displaying analytics dashboards – tracking inter-annotator agreement metrics across a data annotation programme

Why agreement is the metric, not accuracy

Accuracy presumes a known ground truth. In most enterprise annotation work, ground truth is exactly what the team is trying to manufacture. Inter-annotator agreement (IAA) is the closest honest substitute: a measure of whether two or more independent reviewers, given the same example and the same guideline, assign it the same label.

The reason it matters is structural. A model trained on a dataset where reviewers disagree 20% of the time has a 20% performance ceiling baked in – no architectural change can take it higher than the noise floor. The most cost-effective AI investment in many organisations is not a new model; it is dragging IAA on the ambiguous classes from 0.65 to 0.85. The first move is roughly a guideline revision and a gold-panel rebuild. The second is six months of model architecture work that may or may not produce a comparable lift.

IAA also acts as the operational early-warning signal during long-running annotation programmes. When agreement on a specific class drops between batch N and batch N+1, the team can investigate before the noisy labels propagate into the model. Without the IAA report, the same drop is invisible until model evaluation 4–8 weeks later, by which point thousands of mislabelled examples are already in the training set.

Pick the right statistic for the task

IAA is a family of statistics, not a single metric. Picking wrong looks fine on a dashboard and fails in audit. The four metrics that cover almost every enterprise annotation pattern:

  • Cohen's kappa – two annotators, categorical labels. Corrects for chance agreement. Standard reading from Landis and Koch (1977): <0.40 poor, 0.41–0.60 moderate, 0.61–0.80 substantial, >0.81 near-perfect. The default for any small pairwise programme.
  • Fleiss' kappa – three or more annotators, fixed set of categories. Used widely in radiology and pathology where panels of three or more clinicians label the same study. Generalises Cohen's kappa without losing the chance-correction property.
  • Krippendorff's alpha – the most flexible of the family. Handles missing data, ordinal and interval scales, any number of annotators. Klaus Krippendorff's "Content Analysis: An Introduction to Its Methodology" remains the canonical reference. The default for mixed annotation programmes where the schema spans both categorical and ordinal label types.
  • F1 or IoU against a gold panel – for span tagging, segmentation, and bounding-box tasks where chance agreement is hard to define. Pair with a stratified gold panel of 200–1,000 adjudicated examples reviewed by senior annotators. The metric that production CV and document-extraction teams actually report.

A worked numerical example

The arithmetic clarifies why chance-corrected kappa is materially different from raw agreement. Imagine two annotators each labelling 1,000 examples on a binary "fraud / not fraud" classification.

They agree on 900 of the 1,000 examples. Raw agreement is 90% – a number that sounds healthy on a slide. Cohen's kappa, however, requires correcting for the agreement that would happen by chance given the marginal class frequencies. If 95% of the dataset is labelled "not fraud" by both annotators, the expected chance agreement is roughly 0.905. The kappa is therefore (0.90 − 0.905) / (1 − 0.905) ≈ −0.05 – worse than chance.

The slide showing 90% raw agreement and the slide showing kappa ≈ −0.05 describe the same dataset. The first hides the failure; the second reveals it. The vendor that reports raw agreement on heavily imbalanced classification work is either inexperienced or hoping the reviewer will not catch it. In either case, the right question is "what is the kappa, broken down by class?"

What "good" actually means

A 0.85 alpha on a balanced classification task can hide a 0.50 alpha on the rarest and most consequential class. The number worth publishing in any QA report is not a single headline IAA – it is IAA per class, plus a disagreement-cluster report identifying the specific cases or class boundaries where reviewers disagreed most often.

For general enterprise classification work, target κ > 0.80 on the headline metric and κ > 0.75 on the hardest individual class. For regulated and safety-critical domains (medical imaging, autonomous-driving perception, financial fraud detection), the bar is higher and the metric expands. The BraTS brain-tumour segmentation challenge, run annually since 2012, requires multiple expert raters per case and reports Dice scores against an aggregated reference, with explicit treatment of inter-rater variability. Most clinical AI submissions to the FDA and EMA now include some form of inter-rater agreement evidence as part of the data-quality narrative.

For span tagging and bounding-box tasks, F1 against a gold panel is the right metric, and the bar is task-dependent. Document-extraction work with structured KV pairs typically targets F1 ≥ 0.90 with per-field reporting. Object detection on automotive perception datasets targets mAP at IoU ≥ 0.5 with per-class precision and recall, plus a documented bound on the residual error rate by class.

Common mis-uses of IAA

IAA is a precise instrument that produces misleading numbers when used incorrectly. The mis-uses we see most often in production programmes:

  • Reporting only the headline metric. A single dataset-level kappa or alpha number is roughly the same level of information as a single dataset-level accuracy number. Per-class reporting is what surfaces the actual quality problems and is the artefact a regulator or model-risk reviewer will ask for.
  • Measuring on the wrong sample. IAA computed on a random 5% of every batch will under-represent the rarest classes and inflate the headline number. Stratified sampling against the gold panel is the structural fix – over-sample the rare and hard classes proportional to their importance, not their frequency.
  • Comparing kappa across heterogeneous classes. Kappa values are not directly comparable across class definitions. A kappa of 0.78 on a binary class and a kappa of 0.78 on a 14-way class describe materially different reliability levels, because the chance-correction baseline is different in each case.
  • Ignoring the marginal distribution. Highly imbalanced datasets produce kappa values that are sensitive to small changes in the marginal frequencies. The kappa from the previous batch and the kappa from the current batch can diverge by 0.1 with no actual change in annotator behaviour if the underlying class mix shifted. The defensible response is to track kappa, marginal frequencies, and per-class accuracy together, not kappa alone.
  • Treating low IAA as an annotator failure. The reflex when IAA on a class collapses is to retrain the annotator team. In practice, the modal cause is a guideline ambiguity or a schema mismatch – the annotators are not the bug, the class definition is. The defensible playbook investigates the guideline first, the calibration second, and the individual-annotator performance third.

Operationalising IAA without breaking the budget

Measuring agreement on every example is operationally infeasible at production scale. The pattern that produces a defensible IAA programme without unsustainable cost:

  • Run a 10–15% IAA sample on every batch, stratified by class. Rare classes are sampled at a higher rate than their natural frequency so the per-class kappa is statistically meaningful.
  • Treat disagreement as a guideline signal first, an annotator signal second. If two calibrated annotators disagree on the same example three times in a week, the example is the bug – the guideline needs a worked example for that case.
  • Maintain a versioned gold panel of 200–1,000 adjudicated examples. Run new annotators against it at onboarding and recurring re-certification (every 4–6 weeks) thereafter; report drift over time as a separate metric from per-batch IAA.
  • Surface IAA in the model evaluation pipeline. When the model and the panel disagree on an evaluation example, note whether the panel itself disagreed on that example – disagreement on the panel is a noisy ground truth signal, not a model regression.
  • Rotate IAA reviewers across the team. The same two annotators repeatedly co-labelling the same samples will calibrate to each other rather than to the guideline, producing artificially high kappa values that do not transfer to new reviewers.

Reading an IAA report the way a reviewer would

The IAA report is the artefact that travels with the dataset into model-risk review, regulator submission, or downstream consumer onboarding. The shape that holds up under scrutiny has five components.

  • Headline IAA per metric (Cohen's kappa, Krippendorff's alpha, or F1 against gold), with confidence intervals and the sample size on which it was computed.
  • Per-class breakdown of the IAA metric, with the marginal frequency of each class. A kappa of 0.85 on a 90%-frequency class is materially different from a kappa of 0.85 on a 2%-frequency class.
  • Trend over time – the per-batch IAA history for the last 6–12 months on long-running engagements. Reviewers look at the trend as much as the headline; a stable IAA programme is a more defensible artefact than one that swings batch-to-batch with no documented cause.
  • Disagreement-cluster summary – the specific class boundaries or example types where reviewers disagreed most often this batch, with example cases and the adjudication outcome. The highest-leverage artefact for the next guideline revision.
  • Methodology note – the sample stratification, the reviewer rotation policy, the gold-panel refresh date, and the IAA metric choice rationale. The artefact that lets a reviewer two organisations downstream verify the number without re-running the audit.

The hidden cost of skipping IAA entirely

The most expensive failure mode we see is not low IAA. It is no IAA. Teams ship a dataset with no per-class agreement number, train a model on it, find that the model underperforms in production, and re-run the entire labelling cycle to fix it. The all-in cost (rework labour, ML engineer time, evaluation expansion, delayed deployment) routinely exceeds the original annotation budget by 5–10x.

In domains under regulator scrutiny, the cost compounds. Model-risk reviewers, EMA and FDA inspectors, MAS and HKMA model-risk frameworks in Singapore and Hong Kong, and SOC 2 / ISO 27001 auditors all increasingly ask for evidence that the labels themselves were measured – not just the model accuracy that was trained on top of them. A dataset without an IAA report is a dataset that cannot be defended in front of these reviewers without retrofitting the measurement, which is materially harder and more expensive than building it in from day one.

The NIST AI Risk Management Framework treats data quality and traceability as first-class controls. ISO/IEC 5259, the emerging international standard on data quality for analytics and machine learning, explicitly enumerates inter-annotator agreement as a measurable property of a dataset. Programmes set up to comply with either are programmes that will not have to retrofit IAA into a labelling pipeline at the worst possible moment.

Frequently asked questions about IAA

Common questions raised by ML and data-ops teams when standing up an IAA programme:

  • How big does the IAA sample need to be? 10–15% of each batch, stratified by class, is the operational default. Smaller samples (5%) work for stable long-running engagements with audit pass rates consistently above 95%; larger samples (20–30%) are appropriate for the first few batches of any new schema.
  • Which IAA metric should we default to? Cohen's kappa for two-annotator categorical work, Fleiss' kappa for three-or-more annotators, Krippendorff's alpha for mixed categorical/ordinal/missing-data programmes, F1 or IoU against a gold panel for span and bounding-box tasks.
  • How often should we refresh the gold panel? Every 8–12 weeks for long-running engagements. Without rotation, the team implicitly memorises the panel and the score loses its calibration value. The refresh is also the right time to add newly adjudicated hard cases that have surfaced in production batches.
  • Should annotators see IAA scores against their own work? Yes, with care. Per-annotator IAA reports are the right tool for calibration coaching; per-annotator IAA rankings against peers tend to produce a competitive dynamic that depresses overall quality. Coach against the gold panel, not against the peer ranking.
  • Can we use model agreement as a substitute for human IAA? Partially. Confident-learning style audits using a trained model to flag likely mislabels are a useful triage tool, but the model has its own biases. The defensible pattern is to use model-flagging as a sampling signal that feeds into the human IAA programme, not as a substitute for it.
Data Annotation Service

Looking to operationalise the dataset thinking in this post? Our data annotation services Vietnam pod handles collection, cleaning, processing, and pixel-precise annotation across image, video, text, audio, document, and 3D point-cloud data.

Let's build what's next

Share your challenge – AI, data, or infrastructure. We'll scope your project and put the right team on it.