Swill
← Back

Why minDomains Is Critical for TopologySpreadConstraints on EKS with EBS

·Updated April 10, 2026·4 min read·

Background

When deploying a StatefulSet (e.g., a 3-replica etcd cluster) on EKS with Karpenter and EBS-backed PVCs, topologySpreadConstraints with DoNotSchedule may silently fail to distribute pods across Availability Zones. All pods and their EBS volumes can end up in a single AZ, completely defeating the high-availability intent.

The Configuration That Failed

topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app.kubernetes.io/name: etcd
    maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

Expected: 3 pods spread across 3 AZs (1-1-1). Actual: All 3 pods scheduled in us-east-1a.

Root Cause: minDomains Defaults to 1

The Kubernetes scheduler only counts topology domains from nodes that already exist. When Karpenter batch-creates nodes in a single AZ, the scheduler sees only one eligible domain.

With one domain, maxSkew is calculated as:

skew = max_pods_in_any_zone - min_pods_in_any_zone = 3 - 3 = 0

A skew of 0 always satisfies maxSkew: 1, so DoNotSchedule never fires. The constraint is technically met — it just doesn't do what you expect.

This happens because minDomains defaults to 1. The scheduler only requires at least 1 topology domain to exist, and with all nodes in a single AZ, that requirement is trivially satisfied.

The Fix: Set minDomains Explicitly

topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app.kubernetes.io/name: etcd
    maxSkew: 1
    minDomains: 3
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

With minDomains: 3, the scheduler requires at least 3 zone domains before it considers the constraint satisfiable. If only 1 AZ has nodes, the second pod cannot be scheduled — it stays Pending. Karpenter then sees the Pending pod and is forced to launch a node in a different AZ.

Why This Specifically Affects EBS + Karpenter

Three factors combine to create this problem:

  1. Karpenter batch-creates nodes. When multiple pods go Pending simultaneously, Karpenter may launch all nodes in whichever AZ is cheapest or has the most capacity.

  2. The scheduler only knows existing domains. Unlike podAntiAffinity, which rejects a node outright, topologySpreadConstraints calculates skew across known domains. No node in a zone = that zone doesn't exist to the scheduler.

  3. EBS volumes are zone-locked. Once a PVC binds to an EBS volume in us-east-1a, that pod can never move to another AZ. Even with WaitForFirstConsumer, if the first scheduling decision is wrong, the volume permanently pins the pod.

Alternative: Use podAntiAffinity with Zone TopologyKey

If minDomains is not available (requires Kubernetes 1.30+ with the feature gate enabled), podAntiAffinity is a stronger and simpler guarantee:

podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app.kubernetes.io/name: etcd
      topologyKey: topology.kubernetes.io/zone

This tells the scheduler: "no two etcd pods in the same zone, period." The scheduler will reject any node in an AZ that already has an etcd pod, which forces Karpenter to provision nodes in other AZs.

Comparison

Aspect topologySpreadConstraints (no minDomains) topologySpreadConstraints (with minDomains) podAntiAffinity (zone)
Guarantees cross-AZ spread No Yes Yes
Works with single-AZ nodes Silently allows all pods in 1 AZ Pods stay Pending until more AZs available Pods stay Pending until more AZs available
Flexibility for >3 replicas Allows uneven but bounded spread Allows uneven but bounded spread Strictly 1 pod per AZ (max = number of AZs)
Karpenter awareness Weak signal Strong signal (Pending pods) Strongest signal (hard node rejection)
spec:
  template:
    spec:
      topologySpreadConstraints:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: etcd
          maxSkew: 1
          minDomains: 3
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app.kubernetes.io/name: etcd
              topologyKey: kubernetes.io/hostname

This combines minDomains: 3 for cross-AZ spread with hostname anti-affinity for cross-node spread, providing the strongest possible scheduling guarantee for EBS-backed StatefulSets.