The cloud-only default is unravelling
For most of the generative-AI era, "production AI" has meant API calls to a handful of frontier providers. That default made sense when capable small models did not exist and NPU-class edge silicon was scarce. Both conditions have changed sharply in the last twelve months, and the infrastructure conversation inside enterprises is shifting with them.
The pressure is not ideological. It is a mix of economics, latency, privacy, and regulation. LLM API spend has become the fastest-growing cloud line item in many organisations – the FinOps Foundation's 2025 report lists AI cost management as the top practitioner priority for the second year running. Latency-sensitive workloads (voice, real-time moderation, industrial vision) pay a round-trip tax on every call. EU AI Act provisions, APAC data-residency requirements, and sector-specific regimes in health and finance are pushing more processing back inside the perimeter. And the gap between "small model on a local NPU" and "frontier API" has narrowed enough that for many tasks the trade-off has reversed.
The small-model wave that changed the calculus
Four releases during late 2024 through early 2026 redefined what "small" means. Microsoft's Phi-4 (December 2024) matches or beats GPT-3.5-class performance at 14B parameters on reasoning benchmarks. Meta's Llama 3.3 70B (December 2024) and Llama 4 variants in 2025 bring frontier-adjacent quality into a weight class that runs comfortably on a single modern GPU. Google's Gemma 3 (2025) is explicitly optimised for edge and multilingual use, with 1B, 4B, 12B, and 27B tiers. Alibaba's Qwen 2.5 (2024) and Qwen 3 (2025) series have become the default multilingual small-model choice across much of APAC.
Two things are worth noting about this wave. First, the quality floor has moved – a 7B open-weights model in 2026 outperforms a 70B model from 2023 on most practical tasks. Second, and this is the line item enterprise teams tend to miss, fine-tuning those small models on task-specific data routinely closes the gap with frontier models for narrow workloads. A well-tuned 4-8B model running on local hardware can match a frontier API on customer support triage, classification, extraction, and domain-specific generation – at a fraction of the per-query cost and with none of the cross-border data flow.
Edge hardware has finally caught up
The hardware side of the equation has moved just as fast. Four categories are now credible targets for on-device or on-premises inference:
- Modern laptop and phone NPUs. Apple's M-series Neural Engine (M3/M4 generations), Qualcomm Snapdragon X Elite / 8 Elite, and Intel Lunar Lake now deliver 40-50 TOPS on-device at single-digit watts. For sub-8B models in int4 or int8, on-device latency is under 200ms per response.
- Dedicated edge AI accelerators. Hailo-8 and Hailo-10 (13-40 TOPS), Google Coral, and Qualcomm QCS6490 bring inference into industrial cameras, drones, and kiosks at the 5-12W power budgets those form factors allow.
- Jetson-class embedded platforms. NVIDIA Jetson Orin Nano Super (released end of 2024) pushes 67 TOPS at ~25W, enough to run quantised 7-8B models at interactive speeds for on-device assistants, robotics, and vision-language workloads.
- On-prem GPU racks. A single L40S or H100-class node serves hundreds of concurrent users on a fine-tuned 8B-14B model – and, crucially, with full control over latency, data residency, and cost.
The realistic split
The picture we see emerging in customer architectures is not cloud-versus-edge. It is a deliberate two-tier split. Small fine-tuned models run at the edge or on-prem for high-volume, latency-sensitive, or privacy-constrained tasks. Frontier models via API handle the long tail – complex reasoning, multi-modal generation, tasks where frontier quality is genuinely required and volume is low enough that cost per query does not dominate.
Concretely, that looks like this in the field. A retail chain runs a Gemma-3-4B on the point-of-sale device for menu questions, order parsing, and basic personalisation; escalates to a frontier API for complex recommendations and marketing copy. A medical imaging platform runs a quantised vision-language model on-prem for routine triage; routes ambiguous cases to a specialist API. A smart-city deployment does on-camera vision with Hailo-8, pushes only structured events (not video) to the cloud. The per-query cost in the edge tier is measured in fractions of a cent; the frontier tier handles the small slice that cannot be served locally.
The deployment patterns that actually work
Running models at the edge reopens a set of operational problems that a cloud-API deployment never had to solve. Getting them right is the difference between a demo and a production fleet.
- Model distribution and OTA updates. Treat the model as firmware. Sign it, version it, ship it via a staged rollout with canaries, and keep a rollback path. Container images with integrated runtimes (TensorRT-LLM, ONNX Runtime, llama.cpp, Core ML, Qualcomm AI Engine) reduce the surface area.
- Quantisation as a first-class step. Most viable edge deployments run int4 or int8 quantised variants. Validate the quantised model against the same evaluation set as the full-precision weights – quality loss is usually small but not zero, and it varies by task.
- Observability across the fleet. Inference latency, token-per-second, memory pressure, failure modes – the telemetry you had from a cloud endpoint now has to come back from thousands of devices. Treat it as a real data pipeline, not an afterthought.
- Fallback paths. Edge devices can overheat, drop connectivity, encounter inputs they were not trained on. A graceful fallback to the cloud tier, explicit to the product, is the difference between a brittle and a resilient system.
- Hardware-software co-design on procurement. Picking silicon before picking models (or vice versa) usually ends in a gap. For serious deployments, fix the target model weight class first, then qualify hardware against that spec.
The compliance and security upside
Edge inference is not just a cost story. For teams operating under the EU AI Act, APAC data-residency regimes, or sector-specific rules in health and finance, keeping inference inside the perimeter dramatically simplifies compliance. Data never crosses a border. Subject-access requests and audits become tractable. Fundamental-rights impact assessments for high-risk systems are easier to pass when the processing boundary is demonstrably local.
The security posture is better too, in the boring ways that matter. No outbound API traffic means no third-party model provider to vet, no vendor incident to inherit, no per-request logging review, and no prompt-injection payload reaching an external service with credentials. The attack surface moves onto the device, which brings its own hardening work – but it is a surface a mature IT organisation already knows how to manage.
What to prototype in the next two quarters
The 90-day pilot pattern we recommend is narrow and measurable. Pick one high-volume, latency-sensitive, or privacy-constrained task currently served by a frontier API. Fine-tune a 4-8B open-weights model on a representative sample of your traffic. Benchmark quality (against the API baseline), latency, and cost on a realistic edge target. Deploy to a small fleet with full observability and a fallback to the incumbent API.
The goal is not to replace your cloud spend in one quarter. It is to build the operational muscle – model distribution, fleet telemetry, quantisation validation, eval regression – that will matter increasingly over the next two to three years as the quality gap narrows further. The organisations that start building that muscle in 2026 will not be surprised when the default architecture changes.