How to Outsource Data Annotation: A Step-by-Step Guide for 2026

Outsourcing data annotation can accelerate your AI project – or derail it, if done poorly. This guide covers how to scope work, evaluate vendors, structure contracts, run pilots, and manage long-running data annotation services partnerships at production scale.

14 min read
Two professionals reviewing project documents at a desk – data annotation services Vietnam vendor selection workshop

When to outsource data annotation – and when not to

Outsourcing is the right call when annotation volume exceeds what your internal team can deliver inside the model-development sprint cadence, when domain expertise outside your team is required (medical, legal, multilingual APAC content), or when on-call labelling needs to scale up and down with seasonality. Most enterprise AI teams hit at least one of these conditions within the first two production sprints.

Outsourcing is the wrong call when the annotation task is itself the research question – early-stage exploratory labelling where the schema changes every week is better kept in-house until the schema stabilises. It is also a poor fit when the data is so sensitive that the cost of a tightly-controlled in-house pod is lower than the audit overhead of trusting an external vendor. For everything in between, a well-run annotation partner saves the AI team weeks of context-switching and produces a more consistent dataset than ad-hoc internal labelling.

The framework that follows is the eight-step process we see successful AI teams follow when standing up a data annotation services Vietnam engagement, or any equivalent offshore programme.

Step 1: define your annotation requirements before talking to vendors

Before reaching out to vendors, document exactly what you need. Vendors who receive a clear brief respond faster, quote more accurately, and produce better results. The half-hour spent writing a tight brief saves a week of clarification cycles downstream.

A complete brief covers six dimensions:

  • Data type and volume: image, video, text, audio, document, 3D point cloud, or mixed. Total asset count, growth rate, and seasonality.
  • Task definition with annotated examples: bounding boxes, polygons, semantic segmentation, NER, sentiment, intent, key/value extraction. Three to five fully-labelled examples of "ideal output" prevent more interpretation drift than any amount of written guideline can.
  • Accuracy target: minimum inter-annotator agreement (Cohen's kappa, Krippendorff's alpha) per class, plus the F1 or IoU bar against your gold panel. Different classes typically need different bars.
  • Delivery format: JSON, JSONL, CSV, COCO, Pascal VOC, YOLO, BIO tags, or your own schema. Schema stability is more important than format – a stable schema in any format beats a moving target in your preferred one.
  • Volume and timeline: total assets, weekly cadence, and acceptable variance. Distinguish between a one-shot dataset and a continuous pipeline – they have different cost structures.
  • Domain expertise required: general, medical, legal, financial, automotive, or industry-specific. State the language and regional coverage you need explicitly; "English" is rarely what a vendor in Hanoi or Manila will assume.

Step 2: shortlist three to five vendors

Request written proposals from at least three vendors. The shortlist matters more than the headline rate – two vendors who can actually deliver on your accuracy bar are more useful than five who quote aggressively and discover the work is harder than they thought.

Evaluate on six dimensions: experience with your specific data type and domain, QA process transparency, data security and compliance posture, communication responsiveness, references from comparable projects, and willingness to run a paid pilot. Beware of vendors who promise the lowest price with no explanation of how they sustain quality at that cost – the gap usually gets paid back later in rework.

In APAC specifically, the shortlist often spans Vietnam (cost-quality balance), India (English-language scale), the Philippines (voice and conversational work), and Eastern Europe (clinical and legal nuance). For most APAC-facing AI teams the Vietnam tier sits in the sweet spot for image, video, document, and Southeast Asian language work.

Step 3: run a paid pilot before committing to volume

Never commit to a large annotation programme without a pilot. A pilot of 200–500 items gives you real accuracy data, reveals workflow gaps, tests communication, and confirms the vendor understands your task. Pay for the pilot – unpaid pilots often receive a smaller team or junior reviewers, and do not reflect production quality.

  • Define a clear acceptance criterion before the pilot starts (for example, 95%+ accuracy on a stratified gold subset, or Cohen's kappa above 0.80 on the hardest class).
  • Annotate 10–20% of the pilot samples yourself as a gold standard for objective comparison.
  • Measure inter-annotator agreement per class, not just headline accuracy. The class where agreement collapses tells you where the guideline needs more work, not where the vendor is weak.
  • Review edge cases together – the conversation around the hard examples reveals whether annotators understood the task deeply or are pattern-matching surface features.
  • Watch the operational signals as much as the accuracy. Response times, willingness to revise guidelines, and how the vendor handles disagreement during adjudication all predict the long-running engagement better than the headline number.
Project team reviewing pilot annotation results around a laptop – evaluating accuracy, inter-annotator agreement, and edge-case handling before scaling the data annotation services contract

Step 4: structure the contract properly

Annotation contracts are operational documents, not procurement-only paperwork. The right terms protect both sides and prevent the slow, expensive conversations that usually follow when something goes wrong at scale.

  • Data confidentiality and signed NDA – before any sample data leaves your environment, not after.
  • IP ownership: source data, annotations, and any derivative artefacts (labelling guidelines, gold panels) belong to you. The vendor retains no rights and reuses no data on other engagements.
  • Accuracy SLA and delivery cadence – specify the metric (per-class field-level accuracy or kappa), the floor, and the measurement protocol. Vague SLAs are unenforceable.
  • Rework policy – who pays when a batch falls short. The fair structure is: the vendor reworks any batch that misses the SLA at their cost, you pay only for batches that pass.
  • Data deletion at project end – timeline, evidence (deletion certificate), and any retention required for audit. Default should be deletion within 30 days of project close.
  • Pricing model – per-asset, per-minute, per-hour, or fixed project. Avoid blended rates that mask the actual cost driver; the cleanest engagements price each task type explicitly.
  • Security and compliance: ISO 27001 alignment, on-premise or VPC-only deployment if you handle PII, medical, or financial data. NIST AI RMF and ISO/IEC 5259 alignment are useful signals for regulated work.
Workspace with an open annotation services contract, laptop, and notes – capturing the SLA, IP ownership, rework, and data-deletion clauses an outsourcing buyer should negotiate

Step 5: build a feedback loop

The best annotation partnerships improve over time. The dataset on week 26 of a programme is meaningfully better than the dataset on week one not because annotators got faster, but because the schema, gold panel, and disagreement-cluster reports converged on the cases that matter.

Share model performance feedback with the vendor every sprint. When your model struggles on specific data types or edge cases, that signals annotation gaps the vendor can address. Active-learning routing – sending uncertain predictions back for re-labelling – is the structural pattern that makes this loop work at scale.

Run a 45-minute calibration call monthly between your ML team and the annotation team. Walk the disagreement cluster report, decide which guideline rules need clarifying, and update the gold panel for the next batch. This single discipline separates programmes that plateau from programmes that compound.

Step 6: monitor quality without micro-managing

The right level of oversight is structural: a versioned gold panel, per-batch QA reports with inter-annotator agreement by class, and a published disagreement-cluster log. With those three artefacts the buyer can see the quality trend without inspecting every example.

The wrong level of oversight is per-asset review by the buyer. If a buyer ends up spot-checking every batch personally, the vendor is not doing the QA job they were hired for – it is faster to switch vendors than to keep the current one on life support.

Step 7: plan for scale, plan for end-of-life

A successful annotation programme grows. Volume goes up, modalities are added, languages expand. Plan for this from day one: ask the vendor how they have scaled past engagements from a pilot pod (5–10 annotators) to a production programme (50–200+), and what advance notice they need to add a new modality or language.

Plan for end-of-life with equal discipline. At some point every annotation programme either winds down (the model is mature and labelling demand drops) or migrates in-house (the team is large enough to internalise). Make sure the exit path is in the contract: data deletion, knowledge transfer of guidelines and gold panels, and a final hand-off audit. The vendor that helps you exit cleanly is the one you call back next time.

Step 8: red flags that consistently predict bad engagements

These are the patterns that come up before the worst annotation engagements we have seen as third-party observers. Treat any one of them as a strong negative signal:

  • No pilot process. Reputable vendors welcome pilots; vendors who skip them either cannot deliver consistent quality or have nothing to prove.
  • Vague QA descriptions. "We have quality checks" without specific stages, named reviewer roles, and a published kappa methodology is a warning sign.
  • Unwilling to share sample work or client references. Confidentiality is real, but a competent vendor can always produce sanitised samples or a reference call.
  • No NDA or security documentation. Vendors handling enterprise data should have a stock NDA, a DPA, and an ISO 27001 alignment statement on the shelf.
  • Quotes given without seeing your data. Annotation complexity varies too much for a blind quote to be reliable; expect revisions after a sample dataset is shared.
  • No dedicated project manager. If the vendor expects you to manage annotators directly, the labour-cost arbitrage of outsourcing is largely gone.
  • Pricing well below the market floor. Per-asset rates that are 60% lower than the market median almost always reflect a thinner QA tier, junior reviewers, or an SLA that does not commit to rework.

Frequently asked questions

A reference for the questions enterprise AI teams ask most often when scoping a data annotation outsourcing engagement:

  • How fast can a vendor turn around a pilot? Mature vendors deliver a 200–500 item pilot in 5–10 business days from NDA signature, including a labelling guideline draft, the labelled batch, and an inter-annotator agreement report.
  • Can the vendor work inside our VPC or on-premise environment? Established offshore annotation pods, particularly for medical, financial, and regulated work, will deploy inside the buyer's VPC or on-premise with no data egress. This is increasingly the default for regulated industries.
  • How do we own the IP and data we generate together? The default contract assigns full IP ownership of data, labels, gold panels, and labelling guidelines to the buyer. The vendor retains no rights and may only reference the engagement at a high level if the buyer authorises it in writing.
  • What is the right ratio of in-house to outsourced annotation? Most production programmes settle at 10–20% in-house (gold panel curation, edge-case adjudication, model-performance feedback) and 80–90% outsourced (production labelling, first-pass review). The in-house slice is what compounds quality over time.
  • How do we transition from one vendor to another if needed? With a clean contract: deletion certificates from the outgoing vendor, knowledge transfer of guidelines and gold panels, and a paid 200–500 item pilot with the incoming vendor scored against the same gold panel before any volume moves.
Data Annotation Service

Looking to operationalise the dataset thinking in this post? Our data annotation services Vietnam pod handles collection, cleaning, processing, and pixel-precise annotation across image, video, text, audio, document, and 3D point-cloud data.

Let's build what's next

Share your challenge – AI, data, or infrastructure. We'll scope your project and put the right team on it.