The performance gap is real and well-measured
Despite the steady pace of multilingual progress in the open ML research literature, frontier models still under-perform on most APAC languages outside the highest-resourced trio of Mandarin, Japanese, and Korean. The pattern is consistent across benchmarks: a model fluent in English routinely degrades by 10–30 points on Vietnamese, Thai, Bahasa Indonesia, Tagalog, Khmer, Lao, Burmese, and Tetum tasks – not as a deficit of intelligence, but as a deficit of in-language training and evaluation data.
The pattern repeats in evaluation work built explicitly for APAC. AI4Bharat, the IIT Madras-led research initiative, has published successive Indic-language benchmarks showing that the best-performing models on English degrade significantly on Tamil, Bengali, Telugu, and Malayalam tasks – often by more than the gap predicts. VinAI's PhoBERT and PhoGPT lines have made the same case empirically for Vietnamese: a Vietnamese-trained model on Vietnamese tasks routinely beats a much larger English-centric model that has only seen translated Vietnamese.
The implication for production AI teams targeting APAC markets is straightforward. Translated training data underperforms in-language training data in production by a margin that compounds across model size, compute, and evaluation effort. The cheapest path to defensible APAC-language model quality is in-language annotation, not translated annotation paired with a larger model.
Why translation pipelines do not solve it
A common reflex among teams new to APAC-language AI is to source labels in English and translate them into the target language. The approach is tempting because it scales – but it does not survive contact with the production distribution. Three failure modes recur across every APAC-language programme that tries the translation shortcut.
- Cultural specificity. Sentiment, formality, honorifics, and politeness markers in Thai, Korean, Japanese, and Vietnamese carry meaning that has no direct English mapping. A toxicity or intent classifier trained on translated English labels systematically misses in-context insults and miscategorises polite refusals, because the linguistic phenomenon the model needs to learn was never preserved through the translation.
- Script and orthography. Khmer text routinely lacks word boundaries and uses sub-script consonant clusters; Vietnamese diacritics carry phonemic and semantic weight (tone marks are not optional); traditional and simplified Chinese share most characters but encode different conventions and regional vocabulary; Thai has no word spacing in running text. Pipelines that round-trip through English silently strip these distinctions and train the model on the noise that remains.
- Domain vocabulary. Legal Vietnamese, medical Thai, regulatory Bahasa Indonesia, and financial Mandarin all use technical vocabularies that do not appear in mainstream training corpora at any meaningful frequency. Translation produces text that is surface-plausible and substantively wrong, and the production model trained on it fails on exactly the high-value enterprise tasks that justified the AI investment in the first place.
What "good" annotation looks like for these languages
The pattern that consistently produces production-grade datasets for low-resource APAC languages has four observable properties. Programmes that ship all four routinely outperform programmes that ship only the first one or two.
- In-language, in-region annotators. Native speakers, recruited and trained inside the relevant market, with senior reviewers who can adjudicate culturally specific edge cases. Remote-only labelling pipelines staffed thousands of kilometres away from the linguistic and cultural context routinely under-perform.
- Localised guidelines. The labelling guideline is written in the target language first, with English as the secondary reference. Edge cases are encoded with in-language examples, not with translated approximations. The guideline is a living document, updated as the team encounters new edge cases the original specification did not anticipate.
- Explicit cultural taxonomy. Politeness levels, honorific usage, code-switching patterns, regional dialect markers, and other culturally-specific phenomena appear as labels in the schema rather than living in annotator heads. When the schema does not capture them, every annotator interprets them slightly differently and the model learns the disagreement as a feature.
- Native-language QA infrastructure. The inter-annotator agreement panel is in-language. The disagreement-cluster reports are in-language. The senior reviewer adjudicating hard cases is fluent in the specific dialect being annotated, not just the broad language family. The audit team that scores the work is staffed by native speakers, not by reviewers reading translated transcripts of the labels.
Language-by-language: what each major APAC language requires
Each major APAC language has structural features that determine what a defensible annotation programme looks like for it. A brief tour of the principal languages, with the annotation-relevant properties of each.
- Vietnamese: tonal language with no morphological inflection but rich diacritical marks. Tone marks are phonemic – they change meaning, not just pronunciation. Native annotators handle tone-mark recovery, regional accent transcription (Northern vs Southern dialects), and the high rate of borrowed-from-French and borrowed-from-Chinese vocabulary that surfaces in modern Vietnamese text.
- Thai: no word spacing in the written language, complex consonant clusters, six lexical tones with two tone-mark conventions (mai ek, mai tho), and substantial diglossia between formal and colloquial registers. Annotation tooling has to support Thai-specific tokenisation, and the schema has to account for the formal/colloquial register distinction where it matters for the downstream model.
- Bahasa Indonesia and Bahasa Malay: shared root vocabulary with meaningful regional and stylistic variation. Code-switching with English, Mandarin, and regional minority languages is operationally routine in conversational and social-media data. Indonesian text additionally exhibits substantial slang innovation that domain-specific guidelines have to address.
- Tagalog and Filipino: strong English code-switching ("Taglish") that is the operational norm in the Philippines' urban conversational data. The schema for Tagalog NLP work has to capture the language-switch points and the per-segment language identification, rather than treating the text as monolingual.
- Khmer (Cambodia): no word spacing, complex sub-script consonant clusters, and a script that requires Unicode-aware tooling to render and process correctly. The annotator base is smaller than for Vietnamese or Thai, and senior-reviewer adjudication on hard cases is a binding constraint on programme throughput.
- Lao: tonal language with a script related to Thai but with distinct conventions. The native-speaker annotator pool is structurally small; defensible programmes typically source annotators from inside Laos PDR and operate with extended onboarding cycles compared to higher-resource languages.
- Burmese (Myanmar): complex script, multiple romanisation conventions, and a small native-speaker annotator base outside Myanmar. Most production programmes that need Burmese annotation operate under longer ramp times and tighter QA cycles than for the higher-resource APAC languages.
- Tetum (Timor-Leste): low-resource language with limited published reference corpora. Programmes that need Tetum annotation typically combine native-speaker annotation with bilingual reviewer adjudication using Indonesian or Portuguese as the secondary reference language.
- Mandarin (Simplified, Traditional): not a low-resource language globally, but with distinct conventions across mainland China (Simplified), Singapore and Malaysia (Simplified with regional vocabulary), Hong Kong (Traditional + Cantonese influence), and Taiwan (Traditional with mainland-Taiwan vocabulary differences). The schema has to encode which variant is being annotated.
- Korean: not low-resource globally but distinct linguistic properties (agglutinative morphology, six politeness levels, contextually-dropped subjects) make it harder to annotate than the resource level suggests. Honorific tagging is its own annotation dimension that surfaces in nearly every Korean enterprise NLP programme.
- Japanese: like Korean, not strictly low-resource but linguistically demanding (three scripts mixed in normal text, complex politeness register system, contextual omission). Annotation programmes consistently need senior native-speaker reviewers for the politeness and register dimensions.
Staffing pattern: how a defensible multilingual pod is built
The operational pattern that consistently produces production-grade APAC-language datasets has five layers, each addressing a failure mode the layer above does not catch.
- Annotator tier: native speakers recruited inside the target market, with at least one year of prior annotation or BPO experience, and a documented language-and-dialect profile (Hanoi vs Saigon Vietnamese, Bangkok vs Chiang Mai Thai, Jakarta vs Surabaya Indonesian).
- Calibration tier: each annotator is calibrated against a language-specific gold panel before shipping production labels. The calibration target is typically κ > 0.80 against the panel on the headline metric, plus per-dialect spot-checks where dialect matters.
- Reviewer tier: senior native-speaker reviewers handle adjudication on disagreement cases. Reviewers are typically dialect-specific (a Northern Vietnamese reviewer does not adjudicate Southern Vietnamese disagreement, and vice versa) and rotate to prevent reviewer fatigue on the same content.
- Bilingual quality-lead tier: a quality lead who is bilingual in the target language and English bridges the client team and the annotation team. The quality lead translates client feedback into in-language operational changes, and translates in-language quality reports back to the client team without losing the linguistic specifics.
- Linguistic-specialist tier: for the hardest schema decisions (politeness handling in Korean, formal/colloquial register in Thai, dialect markers in Vietnamese), a domain linguist is engaged either continuously or on a consulting basis to design the schema and resolve appeals from the reviewer tier.
LLM-era considerations for low-resource APAC languages
The rise of large multilingual language models has changed what APAC-language annotation is for, without removing the need for it. Three patterns are increasingly dominant in 2026.
First, evaluation-set annotation. The annotated benchmarks that test how well a multilingual LLM performs on Vietnamese, Thai, Khmer, or Tagalog matter materially more in 2026 than they did before the LLM era. Every model on the leaderboard is scored against them, and high-quality APAC-language evaluation panels are some of the most valuable annotation assets a regional AI team can build.
Second, preference data for fine-tuning. APAC-deployed LLMs that need to match the linguistic and cultural conventions of regional users require RLHF-style preference data in the target language. This is high-skill annotation work – the annotators are ranking model outputs against each other on subtle quality dimensions – and the work cannot be sourced cleanly from English-only preference panels.
Third, structured-output and tool-use annotation in target languages. Production LLM systems for regional banking, healthcare, e-commerce, and government use cases need to produce structured outputs (JSON, function calls, form fields) populated with culturally and orthographically correct content. Annotating the correct structured outputs for each input is its own discipline, and is consistently in-language work where the annotator base has to understand both the schema and the linguistic conventions of the target market.
Common pitfalls in APAC-language annotation programmes
Recurring patterns that consistently produce noisy or unusable datasets:
- Using diaspora annotators as a substitute for in-region annotators. A Vietnamese annotator in Sydney or California will produce reasonable work on standard text but systematically miss the contemporary slang, code-switching patterns, and regional vocabulary that production data contains. The diaspora premium is rarely worth the quality cost.
- Treating the language as a monolith. Vietnamese has materially different Northern and Southern dialects. Thai has Central, Northern, Northeastern (Isan), and Southern variants. Chinese has Simplified/Traditional plus regional Mandarin, Cantonese, and other dialect variation. A defensible programme reports per-dialect quality metrics, not just per-language.
- Single-headline IAA reporting on multilingual datasets. A global IAA of 0.84 across a five-language dataset can hide a 0.65 IAA on the Tagalog subset and a 0.92 IAA on the English subset. Per-language reporting is the artefact a model-risk reviewer or auditor will ask for.
- Skipping the cultural taxonomy. Politeness, honorifics, formality registers, code-switching markers, and dialect indicators all belong in the schema for the languages where they matter. When they are not labelled, the model cannot learn them, and the production failures cluster on exactly the cases where the cultural feature was load-bearing.
- Under-investing in senior reviewer capacity. The senior reviewer is the bottleneck for adjudicating hard cases in low-resource languages. Programmes that staff annotators 20:1 to senior reviewers consistently produce noisy adjudication; the defensible ratio is closer to 8:1 to 10:1 in low-resource APAC programmes.
The economics of in-language APAC annotation
In-language APAC annotation is more expensive per unit than English. The downstream economics still favour it. Modelled across the model lifecycle, the highest-quality production systems in non-English markets are consistently the ones with the highest in-language data investment – the cost difference at the labelling stage compounds backwards into much smaller compute, evaluation, and re-labelling bills downstream.
For enterprise teams across Vietnam, Thailand, Singapore, Malaysia, the Philippines, and the Indonesian archipelago, the realistic decision is not "in-language or translated". It is "in-language now, or in-language in 18 months after the translated dataset under-performs in market and has to be rebuilt". The former is cheaper, faster, and produces a defensible artefact rather than a sunk cost.
The cost gap is smaller than buyers typically expect once it is modelled correctly. In-language APAC labelling typically prices at 40–80% of US onshore rates – substantially higher than the bulk English offshore rate, but still meaningfully below the all-in cost of an English-translated dataset that fails in production and has to be redone.
Frequently asked questions
Common questions raised by enterprise AI teams evaluating an APAC-language annotation programme:
- Can we start with translated data and switch to in-language later? Yes, but the switch cost is high. Translated datasets typically need to be reannotated from scratch when the production model is found to underperform, because the underlying language phenomena were never captured. Most teams that start translated and switch find they would have been cheaper starting in-language.
- How do we evaluate a vendor on a specific APAC language? Run a paid pilot of 500–2,000 examples in the target language, with the gold panel and the acceptance criteria specified up front. The per-class IAA, the per-dialect breakdown, and the audit pass rate are the comparable artefacts across vendors. Vendors who report only headline accuracy on a single language are not running a defensible per-language QA programme.
- Should the annotation team be 100% native speakers or can a portion be near-native? Production-grade annotation requires native speakers on the annotator and senior-reviewer tiers. Near-native speakers (advanced second-language) can contribute on bilingual coordination and on the quality-lead tier, but should not be on the primary labelling chain for low-resource APAC languages.
- What about Mandarin – is it really a low-resource language requiring this discipline? Mandarin is not low-resource globally, but the regional variation across mainland China, Singapore, Malaysia, Hong Kong, and Taiwan is meaningful. The schema choice (Simplified vs Traditional, mainland vs regional vocabulary) is consequential, and per-variant reporting is the right operational pattern.
- How do regulatory and data-residency requirements interact with APAC-language annotation? Each market has specific personal-data protection rules (PDPA Singapore, PDPA Thailand, Vietnam Decree 13, Indonesia's PDP Law, PDPO Hong Kong). In-region annotation pods operate routinely under these regulations; offshore pipelines processing the same data face higher cross-border-transfer compliance overhead.


