Why agreement is the metric, not accuracy
Accuracy presumes a known ground truth. In most enterprise annotation work, ground truth is exactly what you are trying to manufacture. Inter-annotator agreement (IAA) is the closest honest substitute: a measure of whether two or more independent reviewers, given the same example and the same guideline, would assign it the same label.
The reason it matters is structural. A model trained on a dataset where reviewers disagree 20% of the time has a 20% performance ceiling baked in – no architectural change can take it higher than the noise floor. The most cost-effective AI investment in many organisations is not a new model; it is dragging IAA on the ambiguous classes from 0.65 to 0.85.
Pick the right statistic for the task
IAA is a family, not a single metric. Picking wrong looks fine on a dashboard and fails in audit:
- Cohen's kappa – two annotators, categorical labels. Corrects for chance agreement. Standard reading: <0.40 poor, 0.41-0.60 moderate, 0.61-0.80 substantial, >0.81 near-perfect (Landis and Koch, 1977).
- Fleiss' kappa – more than two annotators, fixed set of categories. Used widely in radiology and pathology where panels of three or more clinicians label the same study.
- Krippendorff's alpha – the most flexible of the family. Handles missing data, ordinal and interval scales, any number of annotators. Klaus Krippendorff's "Content Analysis: An Introduction to Its Methodology" remains the canonical reference. Hugging Face and DeepLearning.AI both recommend alpha as the default for mixed annotation programmes.
- F1 / IoU against a gold set – for span, segmentation, and bounding-box tasks where chance agreement is hard to define. Pair with a stratified gold panel reviewed by senior annotators.
What "good" actually means
A 0.85 alpha on a balanced classification task can hide a 0.50 alpha on the rarest and most consequential class. The number to publish in your QA report is not a single headline IAA – it is IAA per class, plus disagreement clusters that tell you which guideline rules need attention.
In regulated domains the bar is higher and the metric expands. The BraTS brain-tumor segmentation challenge, run annually since 2012, requires multiple expert raters per case and reports Dice scores against an aggregated reference, with explicit treatment of inter-rater variability. Most clinical AI submissions to the FDA and EMA now include some form of inter-rater agreement evidence as part of the data-quality narrative.
Operationalising IAA without breaking the budget
Measuring agreement on every example is expensive. The pattern that works in production:
- Run a 10-15% IAA sample on every batch, stratified by class so rare categories are not under-sampled.
- Treat disagreement as a guideline signal first, an annotator signal second. If two trained reviewers disagree on the same example three times in a week, the example is the bug.
- Maintain a versioned gold panel of 200-1,000 adjudicated examples. Run new annotators against it at onboarding and recurringly thereafter; report drift over time.
- Surface IAA in the model evaluation pipeline. If the model and the panel disagree, note whether the panel itself disagreed – that prevents you from treating noisy ground truth as a model regression.
The hidden cost of skipping it
The most expensive failure mode we see is not low IAA. It is no IAA. Teams ship a dataset with no per-class agreement number, train a model on it, find that the model under-performs in production, and re-run the entire labelling cycle to fix it. In domains under regulator scrutiny, the cost compounds: model-risk reviewers, EMA inspectors, and SOC 2 auditors all increasingly ask for evidence that the labels themselves were measured.
The NIST AI Risk Management Framework treats data quality and traceability as first-class controls. ISO/IEC 5259 (the data-quality standard for AI) goes further and explicitly enumerates inter-annotator agreement as a measurable property of a dataset. Programmes set up to comply with either are programmes that will not have to retrofit IAA into a labelling pipeline at the worst possible moment.
Where DataX Power fits
Every annotation engagement we run ships with per-class IAA, a versioned gold panel, and a disagreement-cluster report alongside the labels. For clients heading into FDA, MDR, or SOC 2 review, that report becomes part of the submission. If your team is currently running labelling without an IAA programme, that is the first thing we usually fix.