Last month, we crossed a significant milestone: our largest regional cluster now manages over 10,000 nodes. Getting here wasn't straightforward. The standard Kubernetes deployment patterns that work at 100 nodes start breaking down at 1,000, and completely fall apart at 5,000. This post documents the journey and the hard-won lessons along the way.
Where Kubernetes Breaks
Kubernetes was designed for clusters of up to 5,000 nodes—that's the official "supported" limit. But even before you hit that number, you'll encounter problems:
- etcd performance: The control plane's data store becomes a bottleneck
- API server load: LIST and WATCH operations don't scale linearly
- Scheduler throughput: Scheduling decisions slow down dramatically
- Network control plane: kube-proxy and CNI plugins struggle with large endpoint lists
We hit our first wall at around 2,000 nodes. API server latency spiked, etcd started falling behind, and the scheduler couldn't keep up with pod churn. Here's how we fixed it.
Etcd: The Foundation
Everything in Kubernetes flows through etcd. At scale, etcd performance is the single most important factor in cluster health. We made several optimizations:
Dedicated etcd Nodes
Our etcd clusters run on dedicated bare-metal nodes with NVMe SSDs. We use io2 EBS volumes on AWS, but found that local NVMe provides 3-5x better latency for write-heavy workloads.
# etcd node configuration
spec:
machineType: m5d.2xlarge # Local NVMe
rootVolume:
size: 100Gi
type: io2
iops: 10000
etcd:
dataDir: /mnt/nvme/etcd # Local NVMe mount
quotaBackendBytes: 8589934592 # 8GB quota
heartbeatInterval: 100ms
electionTimeout: 1000ms
Tuning etcd Parameters
Default etcd settings are conservative. We increased several parameters:
# /etc/etcd/etcd.conf
ETCD_QUOTA_BACKEND_BYTES=8589934592
ETCD_SNAPSHOT_COUNT=10000
ETCD_HEARTBEAT_INTERVAL=100
ETCD_ELECTION_TIMEOUT=1000
ETCD_MAX_SNAPSHOTS=5
ETCD_MAX_WALS=5
ETCD_AUTO_COMPACTION_MODE=periodic
ETCD_AUTO_COMPACTION_RETENTION=1h
The auto-compaction settings are critical. Without them, etcd's database grows unbounded and performance degrades.
Event Segregation
Kubernetes Events are high-volume, low-value data that doesn't need the same durability as core resources. We run a separate etcd cluster just for Events:
# kube-apiserver configuration
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
spec:
containers:
- command:
- kube-apiserver
- --etcd-servers=https://etcd-main:2379
- --etcd-servers-overrides=/events#https://etcd-events:2379
This single change reduced load on our main etcd cluster by 40%.
API Server Optimizations
The API server is the gateway to your cluster. At scale, it becomes a bottleneck for both reads and writes.
Horizontal Scaling
We run 5 API server replicas behind a load balancer. Kubernetes API servers are stateless and scale horizontally, but you need to be careful about watch connections:
# API server deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-apiserver
spec:
replicas: 5
template:
spec:
containers:
- name: kube-apiserver
args:
- --max-requests-inflight=800
- --max-mutating-requests-inflight=400
- --watch-cache-sizes=pods#10000,nodes#1000
- --default-watch-cache-size=500
resources:
requests:
cpu: "4"
memory: "16Gi"
limits:
cpu: "8"
memory: "32Gi"
Request Priority and Fairness
Kubernetes 1.29 introduced Priority and Fairness (APF) for API requests. This prevents a single misbehaving controller from overwhelming the API server:
apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: PriorityLevelConfiguration
metadata:
name: system-critical
spec:
type: Limited
limited:
assuredConcurrencyShares: 100
limitResponse:
type: Queue
queuing:
queues: 128
handSize: 6
queueLengthLimit: 50
We created custom FlowSchemas for our auction engine and provisioning controllers to ensure they get priority access.
Scheduler Performance
The default scheduler evaluates every node for every pod. At 10,000 nodes, this becomes prohibitively slow.
Percentage of Nodes to Score
The percentageOfNodesToScore parameter limits how many nodes the scheduler evaluates. The default is 50% (or 100 nodes, whichever is larger). We reduced this to 10%:
# scheduler configuration
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 10
profiles:
- schedulerName: default-scheduler
plugins:
score:
disabled:
- name: PodTopologySpread # Expensive at scale
- name: InterPodAffinity # Expensive at scale
Scheduler Sharding
For our multi-tenant setup, we run multiple scheduler instances, each responsible for a subset of namespaces. This provides both isolation and scale:
# Per-tenant scheduler
apiVersion: v1
kind: Pod
metadata:
name: scheduler-tenant-a
spec:
containers:
- name: scheduler
command:
- kube-scheduler
- --leader-elect=true
- --leader-elect-resource-name=scheduler-tenant-a
- --config=/etc/scheduler/config.yaml
Each tenant's pods specify their scheduler via the schedulerName field in the pod spec.
Network Scaling
Network control plane is often overlooked but critical at scale.
Ditching kube-proxy
kube-proxy in iptables mode creates O(n) iptables rules where n is the number of services × endpoints. At scale, iptables updates become slow and CPU-intensive.
We switched to Cilium with eBPF-based service routing. The performance difference is dramatic:
| Metric | kube-proxy (iptables) | Cilium (eBPF) |
|---|---|---|
| Service update latency | 2-5 seconds | 10-50ms |
| CPU usage (10K endpoints) | 40% of 1 core | 5% of 1 core |
| Memory usage | 500MB | 150MB |
EndpointSlices
We ensure all services use EndpointSlices instead of the legacy Endpoints API. EndpointSlices chunk endpoint data into smaller pieces, reducing the size of individual API objects and watch events.
Monitoring at Scale
Standard Prometheus setups don't scale to 10,000 nodes. We use a federated architecture:
┌─────────────────────────────────────────────────────────┐
│ Global Prometheus │
│ (aggregated metrics only) │
└─────────────────────────────────────────────────────────┘
▲
┌───────────────┼───────────────┐
│ │ │
┌──────┴─────┐ ┌──────┴─────┐ ┌──────┴─────┐
│ Regional │ │ Regional │ │ Regional │
│ Prometheus │ │ Prometheus │ │ Prometheus │
│ (us-west) │ │ (us-east) │ │ (eu-west) │
└────────────┘ └────────────┘ └────────────┘
▲ ▲ ▲
┌──────┴─────┐ ┌──────┴─────┐ ┌──────┴─────┐
│ Shard 1-N │ │ Shard 1-N │ │ Shard 1-N │
│ (per-node) │ │ (per-node) │ │ (per-node) │
└────────────┘ └────────────┘ └────────────┘
Each node runs a Prometheus agent that forwards metrics to a regional aggregator. The regional aggregators compute summaries and forward those to a global instance for dashboards and alerting.
Key Metrics to Watch
These are the metrics that predict problems before they become outages:
# Critical alerts for large clusters
groups:
- name: cluster-scale
rules:
- alert: EtcdDatabaseSize
expr: etcd_mvcc_db_total_size_in_bytes > 6e9
for: 5m
labels:
severity: warning
annotations:
summary: "etcd database approaching quota"
- alert: APIServerLatency
expr: histogram_quantile(0.99,
rate(apiserver_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: critical
annotations:
summary: "API server p99 latency > 1s"
- alert: SchedulerBindingLatency
expr: histogram_quantile(0.99,
rate(scheduler_binding_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Scheduler binding latency high"
Lessons Learned
After years of scaling Kubernetes, here's what we know for sure:
- Invest in etcd early. It's the foundation. By the time etcd becomes a problem, everything else is broken too.
- Measure everything. You can't optimize what you don't measure. Instrument your control plane thoroughly.
- Test at scale. Problems that don't appear at 100 nodes will appear at 1,000. Have a staging environment that mirrors production scale.
- Plan for failure. At 10,000 nodes, something is always failing. Design for graceful degradation.
If you're running Kubernetes at scale and want to learn more, feel free to reach out. And if you want to work on these problems, we're hiring.
Sarah Kim is a Principal Platform Engineer at KubeBid, where she leads the infrastructure team. Previously, she worked on Kubernetes at Google.