How We Achieved 99.99% Uptime

99.99% uptime sounds impressive until you do the math: it's 52 minutes of downtime per year. For a platform where customers run production workloads, even that feels like too much. This post explains how we designed KubeBid for high availability and the practices that keep us reliable.

Defining Reliability

First, what do we mean by "uptime"? We measure availability as:

Control plane: Can users create/manage clusters via API?
Data plane: Are customer workloads running?
Auction engine: Can bids be placed and matched?

Our SLA covers all three. A failure in any component counts against our uptime budget.

Architecture for Availability

Multi-Region, Active-Active

KubeBid runs in 40 regions, but that's not just about being close to users. It's about fault isolation. Each region is an independent failure domain:

┌─────────────────────────────────────────────────────────────┐
│                    Global Load Balancer                      │
│                   (Cloudflare / Route53)                     │
└─────────────────────────────────────────────────────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         ▼                    ▼                    ▼
  ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
  │  us-west-2  │      │  us-east-1  │      │  eu-west-1  │
  │             │      │             │      │             │
  │  ┌───────┐  │      │  ┌───────┐  │      │  ┌───────┐  │
  │  │ API   │  │      │  │ API   │  │      │  │ API   │  │
  │  │ Server│  │      │  │ Server│  │      │  │ Server│  │
  │  └───────┘  │      │  └───────┘  │      │  └───────┘  │
  │  ┌───────┐  │      │  ┌───────┐  │      │  ┌───────┐  │
  │  │ etcd  │  │      │  │ etcd  │  │      │  │ etcd  │  │
  │  │cluster│  │      │  │cluster│  │      │  │cluster│  │
  │  └───────┘  │      │  └───────┘  │      │  └───────┘  │
  │  ┌───────┐  │      │  ┌───────┐  │      │  ┌───────┐  │
  │  │Auction│  │      │  │Auction│  │      │  │Auction│  │
  │  │Engine │  │      │  │Engine │  │      │  │Engine │  │
  │  └───────┘  │      │  └───────┘  │      │  └───────┘  │
  └─────────────┘      └─────────────┘      └─────────────┘

If us-west-2 goes down completely, traffic automatically routes to healthy regions. Customer clusters in that region would be affected, but the platform remains operational.

Cell-Based Architecture

Within each region, we use a cell-based architecture. Each cell is a self-contained unit that can serve a subset of customers:

A cell failure affects only customers assigned to that cell
We can migrate customers between cells without downtime
New cells can be added for capacity without affecting existing ones

Redundancy at Every Layer

Every critical component has redundancy:

Component	Redundancy	Recovery Time
API Servers	5 replicas, 3 AZs	<1s (automatic)
etcd	5 nodes, 3 AZs	<30s (leader election)
Database (Postgres)	Primary + 2 replicas	<60s (failover)
Auction Engine	3 replicas per region	<5s (automatic)
Load Balancers	Multi-AZ, multi-region	<1s (automatic)

Chaos Engineering

We don't wait for failures to happen—we cause them intentionally. Our chaos engineering program runs continuously in production (yes, production):

Chaos Experiments We Run

# Example: Kill random API server pods
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: api-server-failure
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - kube-system
    labelSelectors:
      component: kube-apiserver
  scheduler:
    cron: "*/30 * * * *"  # Every 30 minutes
---
# Example: Network partition between AZs
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: az-partition
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - production
  direction: both
  target:
    mode: all
    selector:
      labelSelectors:
        zone: us-west-2b
  duration: "5m"
  scheduler:
    cron: "0 3 * * *"  # Daily at 3 AM

Game Days

Monthly, we run "game days" where we simulate major failures: full region outages, database corruption, DDoS attacks. The entire engineering team participates, and we use these to find gaps in our runbooks and automation.

Observability

You can't fix what you can't see. Our observability stack includes:

Metrics: Prometheus + Thanos for long-term storage
Logs: Vector + ClickHouse for high-cardinality analysis
Traces: OpenTelemetry + Jaeger
Synthetic monitoring: Global probes testing every endpoint

Key SLIs We Monitor

# Service Level Indicators
- name: api_availability
  description: "Percentage of successful API requests"
  query: |
    sum(rate(http_requests_total{code!~"5.."}[5m])) /
    sum(rate(http_requests_total[5m]))
  target: 0.9999

- name: api_latency_p99
  description: "99th percentile API latency"
  query: |
    histogram_quantile(0.99,
      rate(http_request_duration_seconds_bucket[5m]))
  target: 0.5  # 500ms

- name: auction_match_rate
  description: "Percentage of bids matched within SLA"
  query: |
    sum(rate(auction_matches_total{within_sla="true"}[5m])) /
    sum(rate(auction_bids_total[5m]))
  target: 0.999

- name: cluster_provision_time_p95
  description: "95th percentile cluster provisioning time"
  query: |
    histogram_quantile(0.95,
      rate(cluster_provision_duration_seconds_bucket[1h]))
  target: 30  # 30 seconds

Incident Response

When things go wrong (and they do), speed matters. Our incident response process:

Detection: Automated alerts fire within 60 seconds of anomaly
Triage: On-call engineer assesses severity within 5 minutes
Communication: Status page updated within 10 minutes
Mitigation: Focus on restoring service, not root cause
Resolution: Full fix deployed after service is stable
Post-mortem: Blameless review within 48 hours

On-Call Culture

We rotate on-call weekly, with every engineer participating—including founders. This ensures everyone understands operational pain points and incentivizes building reliable systems.

Deployment Safety

Most outages are caused by changes. We've built multiple safety nets:

Canary deployments: New code goes to 1% of traffic first
Automated rollback: Error rate spike triggers automatic revert
Feature flags: New features can be disabled without deploy
Change freeze: No deploys during peak hours or holidays

# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
spec:
  strategy:
    canary:
      steps:
      - setWeight: 1
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 10
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100

What We've Learned

After running at scale for over two years, here are our key learnings:

Simple systems are reliable systems. Every component we add is a potential failure point. We ruthlessly remove complexity.
Test your recovery, not just your systems. Backups you've never restored aren't backups. Failover you've never triggered isn't failover.
Invest in observability early. You can't diagnose problems you can't see. We spent 6 months building observability before launching.
Make the right thing easy. Engineers will take shortcuts. Build systems where the shortcut is also the safe path.

Our Uptime Record

99.97%

Last 12 months

26 minutes total downtime across all services

We're not at 99.99% yet, but we're close. Every incident teaches us something, and we're continuously improving.

If you're interested in building reliable systems at scale, we're hiring SREs. And if you want to run your workloads on a platform built for reliability, try KubeBid.

Nina Patel is a Site Reliability Engineer at KubeBid. Previously, she was an SRE at Netflix and helped build their chaos engineering program.