Infrastructure Service

Kubernetes for AI in 2026: GPU Scheduling, Kueue, and Why Your Cluster Is Starving

Default Kubernetes scheduling is wrong for AI workloads. Kueue, Volcano, topology-aware placement, MIG partitioning, and the operational discipline that distinguishes a 40%-utilised cluster from a 75%-utilised one are not optional for serious GPU workloads in 2026. This guide is a practitioner's walk-through of the scheduler controls that actually move the number.

19 October 202514 min read

By Chris Pham

Server rack with illuminated GPUs in a data centre – representing Kubernetes GPU scheduling, Kueue / Volcano queueing, and topology-aware placement for AI workloads

Why the default scheduler is wrong for AI

Kubernetes was designed for stateless, horizontally-scalable web services. Its default scheduler optimises for placing individual pods on nodes that have room, treats each pod as independent of the others, and evicts opportunistically when capacity pressure changes. That behaviour is appropriate for microservices. It is almost entirely wrong for AI workloads.

An AI training job is not a single pod – it is a gang of pods that must all start together, must run co-located for inter-GPU network performance, must share a tight topology to keep high-bandwidth interconnect utilised, and must checkpoint before they can be safely preempted. Scheduled with default Kubernetes policies, training jobs partially start, sit with one or two pods waiting for capacity, time out, restart from zero, and burn money on compute that produces nothing.

Inference workloads have a different but related problem. Latency-sensitive serving traffic gets scheduled next to batch jobs that are competing for the same physical GPU, and the p99 latency page that fires at 3am is usually caused by the scheduler, not the model. The team spends a week debugging the model when the actual fix is in the cluster configuration.

In 2026, running AI at scale on Kubernetes means replacing or supplementing the default scheduler with controls specifically designed for the workload. The specifics of what to install and how to configure it are the difference between a cluster sitting at 40% utilisation and one running at 75% – which is the difference between under-provisioning that frustrates engineers and a platform that pays for itself.

Gang scheduling: Kueue and Volcano

Gang scheduling – the guarantee that either all pods of a job get their resources simultaneously, or none do – is the single most important capability for training workloads. Without it, partial-start jobs idle capacity that other workloads could be using. With it, training capacity gets utilised efficiently and queue behaviour becomes predictable.

Two open-source solutions have converged as the production defaults in 2026:

Kueue. Kubernetes-native, lives in the kubernetes-sigs ecosystem, designed to sit on top of existing clusters as an admission-control layer. Provides queues, workload admission, fair sharing across teams, resource flavours, and topology-aware placement policies. Increasingly the default choice for organisations that want a Kubernetes-first story without operational overhead.
Volcano. CNCF-hosted, batch-scheduling-focused, with strong support for gang scheduling, job-level policies, and complex workload topologies. Deeper feature set than Kueue for pure batch workloads, somewhat heavier to operate. Appropriate when the workload mix is dominated by training with limited inference requirements.

Topology-aware placement and networking

Large-model training is bottlenecked by inter-GPU communication. Modern training libraries assume specific topologies for high-bandwidth interconnect; the default Kubernetes scheduler has no awareness of whether two pods are on directly-connected GPUs within the same node, on same-switch nodes within a rack, or spread across an oversubscribed network fabric.

The fix is topology-aware scheduling with explicit hints. The combination of Kubernetes Topology Manager plus the GPU operator now exposes topology labels that let schedulers co-locate tightly-coupled jobs. On multi-node training, Kueue's topology-aware placement policies combined with the GPU operator's device-plugin reporting has become reliable enough to count on as production infrastructure.

For inference at scale, Multi-Instance GPU (MIG) partitioning is the under-used control that lets a single high-end datacentre GPU host many smaller inference workloads at predictable latency – provided the scheduler is MIG-aware. MIG-aware scheduling is operationally more complex than standard GPU scheduling, but the cost economics on inference workloads make the operational investment worth it for sustained-volume deployments.

Inference scheduling is fundamentally different

The scheduling pattern that works for training fails for inference. The two workload classes have fundamentally different requirements that the cluster architecture has to acknowledge explicitly.

Training wants co-location, gang scheduling, tolerance for long queue times, checkpoint-and-resume on preemption, and batch-oriented throughput optimisation.
Inference wants autoscaling on request concurrency, horizontal distribution across replicas, latency guarantees with sub-second p99 budgets, resistance to preemption (which would cause user-visible incidents), and scale-to-zero economics when traffic is bursty.

Production inference patterns in 2026

Two open-source patterns have converged as the production defaults for AI inference on Kubernetes in 2026:

Knative-on-KServe for HTTP inference workloads. Provides scale-to-zero, scale-from-zero on first request, and autoscaling triggered by request concurrency, GPU utilisation, or custom metrics. The right default for synchronous inference behind an HTTP endpoint.
Ray Serve for more complex inference topologies. Multi-model routing, model composition, dynamic batching across requests, deployment graph orchestration. The right choice when the inference architecture is non-trivial – multiple models pipelined per request, A/B routing logic, conditional model selection.

Running training and inference on one cluster

Most enterprise teams end up running both training and inference on the same cluster after the initial separation experiments produce dual operational overhead. The shared-cluster pattern works in 2026 when the configuration acknowledges the workload split explicitly.

The operational pattern: Kueue handles training admission and queueing; KServe or Ray Serve handles inference deployment; both coexist with the GPU operator for device reporting and topology labelling; preemption policies are configured so training cannot preempt inference, but low-priority training can preempt other low-priority training. The result is a cluster where inference latency is predictable, training throughput is high, and the team operates one production environment instead of two.

Quotas, fair sharing, and the politics of GPU allocation

The operational problem that most surprises infrastructure teams when they start scheduling AI workloads on Kubernetes is not technical. It is political: who gets GPUs, when, and who adjudicates between competing requests. The default of first-come-first-served produces exactly the outcomes the reader expects – one team grabbing the entire cluster for a multi-day training run, others waiting, morale-eroding disputes at sprint planning, and the eventual escalation to leadership for arbitration.

Kueue's ClusterQueue and LocalQueue primitives let the platform team carve the cluster into named queues with guaranteed capacity, borrowable capacity, and priority classes that resolve the arbitration in code rather than in meetings. A useful starting topology for a mid-sized AI organisation:

One guaranteed queue per team, sized to cover the team's 50th-percentile load. This is the floor each team can always rely on.
One borrowable pool sized to absorb burst capacity above the per-team guaranteed floors. Teams can use more than their guaranteed allocation when others are not, and yield it back when contention arrives.
One shared low-priority queue for experiments that can tolerate preemption. Casual research and exploratory work lives here; preemption-on-contention frees the capacity when production work needs it.
Preemption policies that protect production: enable preemption for the low-priority queue only, so casual experiments do not kill production training runs that have hours of progress invested.

Observability is half the battle

A scheduler that cannot be measured cannot be tuned. Three categories of metrics belong on the wall of any GPU cluster in 2026, with dashboards that the platform team checks daily and the ML teams check weekly:

GPU utilisation per node, per queue, per team. The difference between "GPU allocated" and "GPU actually used" is where money hides. Model FLOPs utilisation (MFU) where the underlying training framework exposes it is the highest-signal single metric on a training cluster.
Queue wait time per queue and per priority. A spike in wait time for the production queue is an early indicator of a misconfigured fair-share policy, an under-provisioned floor, or a workload pattern that has shifted away from the original sizing assumptions.
Preemption count per queue and per job. A cluster that is preempting more than 5% of jobs is either under-provisioned (add capacity) or mis-configured (revisit the preemption policy). Either way, it is not in a healthy operational state.
Job completion time distribution. The tail of jobs that take 5–10x the median time is where the operational anomalies live. P99 completion time tells the team more than the mean.
GPU memory utilisation. The other half of the cost-attribution picture; jobs that under-utilise compute but saturate memory have different operational implications than jobs that under-utilise both.
Per-feature cost attribution. The link between cluster operational metrics and the FinOps cost story; every job carries the tags that let the team attribute cost back to the product feature that triggered the workload.

The rollout plan that works

For teams that have grown organically and are now hitting the scheduling wall, a pragmatic 90-day sequence lands the upgrade cleanly most of the time. The win at the end of the cycle is usually substantial: utilisation up 15–25 percentage points, queue-wait complaints down sharply, and – most importantly – the cluster stops being the bottleneck for the ML teams that depend on it.

Weeks 1–2. Install the GPU operator if not already present, get DCGM metrics flowing into the observability stack, baseline current utilisation and wait-time numbers for comparison.
Weeks 3–4. Stand up Kueue, migrate one team's workload to a queue with explicit capacity allocation, leave the rest of the cluster on default scheduling as a baseline. The single-team pilot lets the platform team learn the operational pattern before scaling it.
Weeks 5–8. Expand Kueue coverage to the remaining teams, introduce priority classes, enable preemption for the lowest-priority queue only. Communicate the queue model to the teams so they understand the operational expectations.
Weeks 9–12. Add topology-aware placement for multi-GPU training, set up the observability dashboards on wait time and utilisation per queue, run a retrospective with the ML teams on how the new model has affected their workflow.

Common failure modes during the rollout

Recurring patterns that produce GPU-scheduling rollouts that look successful on the dashboard and produce ML-team frustration on the ground:

Capacity guarantees set too high. If the per-team guaranteed floors sum to more than the cluster's actual capacity, the queueing system cannot honour the guarantees and the team that loses the race ends up worse off than under the original first-come-first-served model.
Borrowable capacity not configured. Without borrowable capacity above the floors, teams cannot burst even when capacity is idle. Utilisation stays low; the platform team gets blamed for "the new queueing system is restrictive".
Preemption enabled too aggressively. If preemption is enabled for the standard priority class rather than only the low-priority class, production training runs lose hours of progress to casual experiments. The trust loss is hard to recover.
No communication to ML teams. The queueing model is the user-facing change; without explicit communication of how it works, ML teams interpret normal queue waits as the cluster being broken.
Topology-aware placement skipped. The queueing infrastructure gets configured correctly but the topology hints are not exposed, so multi-GPU training jobs end up scheduled across slow network fabric and run slower than they would have on the simpler topology.
Observability ships late. The team configures the scheduling without the dashboards that show whether it is working. Tuning blind is harder than tuning with measurement.

Why this compounds across infrastructure quarters

The organisations that treat GPU scheduling as "platform plumbing" – an ongoing optimisation discipline rather than a one-time project – tend to compound the gains over subsequent quarters. Utilisation rises another 5–10 points in year two as the team tunes the queueing topology against actual workload patterns. Cost-per-training-run drops as the topology hints catch the cases where the original placement was sub-optimal. New workload patterns get absorbed without re-architecting the cluster.

The organisations that treat the upgrade as a one-time project drift back to default-like patterns over 12–18 months. The Kueue configuration gets less attention as the team rotates; the topology hints stop being checked; the dashboards stay up but nobody reads them. The cluster utilisation slowly slides back toward the pre-Kueue baseline, and the cost-per-GPU-hour creeps back up.

The structural difference between the two outcomes is whether the team treats GPU scheduling as an asset that needs operational ownership, or as a project that ships and then becomes someone else's problem. The asset framing is what produces durable utilisation gains.

Frequently asked questions

Common questions raised by platform teams scoping a GPU-scheduling upgrade for their AI workloads:

Should I use Kueue or Volcano? Kueue for most enterprise clusters in 2026 – it is Kubernetes-native, simpler to operate, and the feature set has matured to cover the common cases. Volcano if the workload mix is dominated by complex batch training and the team can absorb the additional operational complexity.
How do I handle a mixed workload of training, inference, and notebooks? Kueue for training admission, KServe or Ray Serve for inference, and per-user TTL policies for notebook capacity. The three workload classes have different scheduling requirements; conflating them in one queue produces sub-optimal outcomes for all three.
What is the right ratio of training to inference capacity on a shared cluster? Workload-dependent. Production AI shops typically run 60–80% inference and 20–40% training in steady state; research-heavy environments invert the ratio. Measure the actual demand and tune; the right answer is the empirical one for the specific team.
How does this interact with multi-cloud strategy? The Kubernetes scheduling primitives are largely portable across cloud providers. The GPU-specific configuration (operator versions, MIG support, topology reporting) varies by cloud and by region; budget for the cross-cloud variation in the platform engineering plan.
When does Kubernetes-on-GPU stop being the right answer? When the deployment is small enough (single-node, single-GPU) that the operational overhead of Kubernetes exceeds the benefit. Below that threshold, a simpler container orchestration or even direct GPU scheduling can be appropriate. Above it, Kubernetes plus Kueue is the established default.

Back to all posts

Infrastructure Service

Need the platform layer to make this stick in production? Our Hanoi-based infrastructure team delivers DevOps, FinOps, SecOps, and AI/MLOps for enterprises on AWS, GCP, Azure, and on-premise.

Cloud infrastructure services from Hanoi – DevOps, FinOps, SecOps, AI/MLOps More Infrastructure Service insights Browse Infrastructure Service case studies

Keep reading

Industrial robot arm operating autonomously on a smart manufacturing facility floor - representing AI-powered Industry 4.0 production

AI Solutions

AI in Smart Manufacturing: Building the Industry 4.0 Factory Floor

Industry 4.0 is no longer a roadmap concept for most APAC manufacturers - it is a competitive requirement. This guide covers the five AI domains transforming production operations, the data infrastructure that makes them work, and the implementation sequence that separates successful deployments from costly pilots that never reach production.

Automated robotic systems operating in a modern manufacturing facility - representing AI-powered computer vision quality control on the factory floor

AI Solutions

Computer Vision for Quality Control: How AI Is Replacing Manual Inspection on the Factory Floor

Manual visual inspection misses 10-20% of defects on high-speed production lines. AI-powered computer vision systems running at line speed achieve defect detection rates above 99% for well-defined defect classes - and unlike human inspectors, performance does not degrade on the third shift. This guide covers the deployment requirements, data infrastructure, and ROI drivers that determine whether a computer vision quality control system actually works in production.

Ready to Get Started

Let's build what's next

Share your challenge – AI, data, or infrastructure. We'll scope your project and put the right team on it.

Start a Conversation See Case Studies