Back to Blog
Engineering July 22, 2024 12 min read

Scaling Kubernetes to 10,000 Nodes

Lessons learned from managing massive multi-tenant Kubernetes clusters and the optimizations that made it possible.

SK
Sarah Kim
Principal Platform Engineer

Last month, we crossed a significant milestone: our largest regional cluster now manages over 10,000 nodes. Getting here wasn't straightforward. The standard Kubernetes deployment patterns that work at 100 nodes start breaking down at 1,000, and completely fall apart at 5,000. This post documents the journey and the hard-won lessons along the way.

Where Kubernetes Breaks

Kubernetes was designed for clusters of up to 5,000 nodes—that's the official "supported" limit. But even before you hit that number, you'll encounter problems:

We hit our first wall at around 2,000 nodes. API server latency spiked, etcd started falling behind, and the scheduler couldn't keep up with pod churn. Here's how we fixed it.

Etcd: The Foundation

Everything in Kubernetes flows through etcd. At scale, etcd performance is the single most important factor in cluster health. We made several optimizations:

Dedicated etcd Nodes

Our etcd clusters run on dedicated bare-metal nodes with NVMe SSDs. We use io2 EBS volumes on AWS, but found that local NVMe provides 3-5x better latency for write-heavy workloads.

# etcd node configuration
spec:
  machineType: m5d.2xlarge  # Local NVMe
  rootVolume:
    size: 100Gi
    type: io2
    iops: 10000
  etcd:
    dataDir: /mnt/nvme/etcd  # Local NVMe mount
    quotaBackendBytes: 8589934592  # 8GB quota
    heartbeatInterval: 100ms
    electionTimeout: 1000ms

Tuning etcd Parameters

Default etcd settings are conservative. We increased several parameters:

# /etc/etcd/etcd.conf
ETCD_QUOTA_BACKEND_BYTES=8589934592
ETCD_SNAPSHOT_COUNT=10000
ETCD_HEARTBEAT_INTERVAL=100
ETCD_ELECTION_TIMEOUT=1000
ETCD_MAX_SNAPSHOTS=5
ETCD_MAX_WALS=5
ETCD_AUTO_COMPACTION_MODE=periodic
ETCD_AUTO_COMPACTION_RETENTION=1h

The auto-compaction settings are critical. Without them, etcd's database grows unbounded and performance degrades.

Event Segregation

Kubernetes Events are high-volume, low-value data that doesn't need the same durability as core resources. We run a separate etcd cluster just for Events:

# kube-apiserver configuration
apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver
spec:
  containers:
  - command:
    - kube-apiserver
    - --etcd-servers=https://etcd-main:2379
    - --etcd-servers-overrides=/events#https://etcd-events:2379

This single change reduced load on our main etcd cluster by 40%.

API Server Optimizations

The API server is the gateway to your cluster. At scale, it becomes a bottleneck for both reads and writes.

Horizontal Scaling

We run 5 API server replicas behind a load balancer. Kubernetes API servers are stateless and scale horizontally, but you need to be careful about watch connections:

# API server deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-apiserver
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: kube-apiserver
        args:
        - --max-requests-inflight=800
        - --max-mutating-requests-inflight=400
        - --watch-cache-sizes=pods#10000,nodes#1000
        - --default-watch-cache-size=500
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
          limits:
            cpu: "8"
            memory: "32Gi"

Request Priority and Fairness

Kubernetes 1.29 introduced Priority and Fairness (APF) for API requests. This prevents a single misbehaving controller from overwhelming the API server:

apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: PriorityLevelConfiguration
metadata:
  name: system-critical
spec:
  type: Limited
  limited:
    assuredConcurrencyShares: 100
    limitResponse:
      type: Queue
      queuing:
        queues: 128
        handSize: 6
        queueLengthLimit: 50

We created custom FlowSchemas for our auction engine and provisioning controllers to ensure they get priority access.

Scheduler Performance

The default scheduler evaluates every node for every pod. At 10,000 nodes, this becomes prohibitively slow.

Percentage of Nodes to Score

The percentageOfNodesToScore parameter limits how many nodes the scheduler evaluates. The default is 50% (or 100 nodes, whichever is larger). We reduced this to 10%:

# scheduler configuration
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 10
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      disabled:
      - name: PodTopologySpread  # Expensive at scale
      - name: InterPodAffinity   # Expensive at scale

Scheduler Sharding

For our multi-tenant setup, we run multiple scheduler instances, each responsible for a subset of namespaces. This provides both isolation and scale:

# Per-tenant scheduler
apiVersion: v1
kind: Pod
metadata:
  name: scheduler-tenant-a
spec:
  containers:
  - name: scheduler
    command:
    - kube-scheduler
    - --leader-elect=true
    - --leader-elect-resource-name=scheduler-tenant-a
    - --config=/etc/scheduler/config.yaml

Each tenant's pods specify their scheduler via the schedulerName field in the pod spec.

Network Scaling

Network control plane is often overlooked but critical at scale.

Ditching kube-proxy

kube-proxy in iptables mode creates O(n) iptables rules where n is the number of services × endpoints. At scale, iptables updates become slow and CPU-intensive.

We switched to Cilium with eBPF-based service routing. The performance difference is dramatic:

Metric kube-proxy (iptables) Cilium (eBPF)
Service update latency 2-5 seconds 10-50ms
CPU usage (10K endpoints) 40% of 1 core 5% of 1 core
Memory usage 500MB 150MB

EndpointSlices

We ensure all services use EndpointSlices instead of the legacy Endpoints API. EndpointSlices chunk endpoint data into smaller pieces, reducing the size of individual API objects and watch events.

Monitoring at Scale

Standard Prometheus setups don't scale to 10,000 nodes. We use a federated architecture:

┌─────────────────────────────────────────────────────────┐
│                  Global Prometheus                       │
│             (aggregated metrics only)                    │
└─────────────────────────────────────────────────────────┘
                           ▲
           ┌───────────────┼───────────────┐
           │               │               │
    ┌──────┴─────┐  ┌──────┴─────┐  ┌──────┴─────┐
    │ Regional   │  │ Regional   │  │ Regional   │
    │ Prometheus │  │ Prometheus │  │ Prometheus │
    │ (us-west)  │  │ (us-east)  │  │ (eu-west)  │
    └────────────┘  └────────────┘  └────────────┘
           ▲               ▲               ▲
    ┌──────┴─────┐  ┌──────┴─────┐  ┌──────┴─────┐
    │ Shard 1-N  │  │ Shard 1-N  │  │ Shard 1-N  │
    │ (per-node) │  │ (per-node) │  │ (per-node) │
    └────────────┘  └────────────┘  └────────────┘

Each node runs a Prometheus agent that forwards metrics to a regional aggregator. The regional aggregators compute summaries and forward those to a global instance for dashboards and alerting.

Key Metrics to Watch

These are the metrics that predict problems before they become outages:

# Critical alerts for large clusters
groups:
- name: cluster-scale
  rules:
  - alert: EtcdDatabaseSize
    expr: etcd_mvcc_db_total_size_in_bytes > 6e9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "etcd database approaching quota"

  - alert: APIServerLatency
    expr: histogram_quantile(0.99,
      rate(apiserver_request_duration_seconds_bucket[5m])) > 1
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "API server p99 latency > 1s"

  - alert: SchedulerBindingLatency
    expr: histogram_quantile(0.99,
      rate(scheduler_binding_duration_seconds_bucket[5m])) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Scheduler binding latency high"

Lessons Learned

After years of scaling Kubernetes, here's what we know for sure:

  1. Invest in etcd early. It's the foundation. By the time etcd becomes a problem, everything else is broken too.
  2. Measure everything. You can't optimize what you don't measure. Instrument your control plane thoroughly.
  3. Test at scale. Problems that don't appear at 100 nodes will appear at 1,000. Have a staging environment that mirrors production scale.
  4. Plan for failure. At 10,000 nodes, something is always failing. Design for graceful degradation.

If you're running Kubernetes at scale and want to learn more, feel free to reach out. And if you want to work on these problems, we're hiring.

Sarah Kim is a Principal Platform Engineer at KubeBid, where she leads the infrastructure team. Previously, she worked on Kubernetes at Google.

Related Posts