99.99% uptime sounds impressive until you do the math: it's 52 minutes of downtime per year. For a platform where customers run production workloads, even that feels like too much. This post explains how we designed KubeBid for high availability and the practices that keep us reliable.
Defining Reliability
First, what do we mean by "uptime"? We measure availability as:
- Control plane: Can users create/manage clusters via API?
- Data plane: Are customer workloads running?
- Auction engine: Can bids be placed and matched?
Our SLA covers all three. A failure in any component counts against our uptime budget.
Architecture for Availability
Multi-Region, Active-Active
KubeBid runs in 40 regions, but that's not just about being close to users. It's about fault isolation. Each region is an independent failure domain:
┌─────────────────────────────────────────────────────────────┐
│ Global Load Balancer │
│ (Cloudflare / Route53) │
└─────────────────────────────────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ us-west-2 │ │ us-east-1 │ │ eu-west-1 │
│ │ │ │ │ │
│ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │
│ │ API │ │ │ │ API │ │ │ │ API │ │
│ │ Server│ │ │ │ Server│ │ │ │ Server│ │
│ └───────┘ │ │ └───────┘ │ │ └───────┘ │
│ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │
│ │ etcd │ │ │ │ etcd │ │ │ │ etcd │ │
│ │cluster│ │ │ │cluster│ │ │ │cluster│ │
│ └───────┘ │ │ └───────┘ │ │ └───────┘ │
│ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │
│ │Auction│ │ │ │Auction│ │ │ │Auction│ │
│ │Engine │ │ │ │Engine │ │ │ │Engine │ │
│ └───────┘ │ │ └───────┘ │ │ └───────┘ │
└─────────────┘ └─────────────┘ └─────────────┘
If us-west-2 goes down completely, traffic automatically routes to healthy regions. Customer clusters in that region would be affected, but the platform remains operational.
Cell-Based Architecture
Within each region, we use a cell-based architecture. Each cell is a self-contained unit that can serve a subset of customers:
- A cell failure affects only customers assigned to that cell
- We can migrate customers between cells without downtime
- New cells can be added for capacity without affecting existing ones
Redundancy at Every Layer
Every critical component has redundancy:
| Component | Redundancy | Recovery Time |
|---|---|---|
| API Servers | 5 replicas, 3 AZs | <1s (automatic) |
| etcd | 5 nodes, 3 AZs | <30s (leader election) |
| Database (Postgres) | Primary + 2 replicas | <60s (failover) |
| Auction Engine | 3 replicas per region | <5s (automatic) |
| Load Balancers | Multi-AZ, multi-region | <1s (automatic) |
Chaos Engineering
We don't wait for failures to happen—we cause them intentionally. Our chaos engineering program runs continuously in production (yes, production):
Chaos Experiments We Run
# Example: Kill random API server pods
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: api-server-failure
spec:
action: pod-kill
mode: one
selector:
namespaces:
- kube-system
labelSelectors:
component: kube-apiserver
scheduler:
cron: "*/30 * * * *" # Every 30 minutes
---
# Example: Network partition between AZs
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: az-partition
spec:
action: partition
mode: all
selector:
namespaces:
- production
direction: both
target:
mode: all
selector:
labelSelectors:
zone: us-west-2b
duration: "5m"
scheduler:
cron: "0 3 * * *" # Daily at 3 AM
Game Days
Monthly, we run "game days" where we simulate major failures: full region outages, database corruption, DDoS attacks. The entire engineering team participates, and we use these to find gaps in our runbooks and automation.
Observability
You can't fix what you can't see. Our observability stack includes:
- Metrics: Prometheus + Thanos for long-term storage
- Logs: Vector + ClickHouse for high-cardinality analysis
- Traces: OpenTelemetry + Jaeger
- Synthetic monitoring: Global probes testing every endpoint
Key SLIs We Monitor
# Service Level Indicators
- name: api_availability
description: "Percentage of successful API requests"
query: |
sum(rate(http_requests_total{code!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
target: 0.9999
- name: api_latency_p99
description: "99th percentile API latency"
query: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m]))
target: 0.5 # 500ms
- name: auction_match_rate
description: "Percentage of bids matched within SLA"
query: |
sum(rate(auction_matches_total{within_sla="true"}[5m])) /
sum(rate(auction_bids_total[5m]))
target: 0.999
- name: cluster_provision_time_p95
description: "95th percentile cluster provisioning time"
query: |
histogram_quantile(0.95,
rate(cluster_provision_duration_seconds_bucket[1h]))
target: 30 # 30 seconds
Incident Response
When things go wrong (and they do), speed matters. Our incident response process:
- Detection: Automated alerts fire within 60 seconds of anomaly
- Triage: On-call engineer assesses severity within 5 minutes
- Communication: Status page updated within 10 minutes
- Mitigation: Focus on restoring service, not root cause
- Resolution: Full fix deployed after service is stable
- Post-mortem: Blameless review within 48 hours
On-Call Culture
We rotate on-call weekly, with every engineer participating—including founders. This ensures everyone understands operational pain points and incentivizes building reliable systems.
Deployment Safety
Most outages are caused by changes. We've built multiple safety nets:
- Canary deployments: New code goes to 1% of traffic first
- Automated rollback: Error rate spike triggers automatic revert
- Feature flags: New features can be disabled without deploy
- Change freeze: No deploys during peak hours or holidays
# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-server
spec:
strategy:
canary:
steps:
- setWeight: 1
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 10
- pause: {duration: 10m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
What We've Learned
After running at scale for over two years, here are our key learnings:
- Simple systems are reliable systems. Every component we add is a potential failure point. We ruthlessly remove complexity.
- Test your recovery, not just your systems. Backups you've never restored aren't backups. Failover you've never triggered isn't failover.
- Invest in observability early. You can't diagnose problems you can't see. We spent 6 months building observability before launching.
- Make the right thing easy. Engineers will take shortcuts. Build systems where the shortcut is also the safe path.
Our Uptime Record
We're not at 99.99% yet, but we're close. Every incident teaches us something, and we're continuously improving.
If you're interested in building reliable systems at scale, we're hiring SREs. And if you want to run your workloads on a platform built for reliability, try KubeBid.
Nina Patel is a Site Reliability Engineer at KubeBid. Previously, she was an SRE at Netflix and helped build their chaos engineering program.