Deploy Your First ML Model on KubeBid

Running ML inference can be expensive. GPU instances on major cloud providers cost $3-8/hour for capable hardware. But with KubeBid's auction pricing, you can often get the same GPUs for $1-3/hour. This tutorial shows you how to deploy a PyTorch model and start saving.

What You'll Build

A scalable inference service for a Llama 2 model, running on A100 GPUs with automatic scaling based on request volume and GPU price.

Prerequisites

Before we start, make sure you have:

A KubeBid account (sign up free)
The KubeBid CLI installed
Docker installed locally
Basic familiarity with Kubernetes

Step 1: Set Up Your Cluster

First, let's create a GPU cluster with auction pricing. We'll set a maximum bid of $4.50/hour for A100 GPUs (typically $8.50 on-demand):

# Install/update the CLI
curl -sSL https://get.kubebid.io | bash

# Login with your API key
kubebid auth login

# Create a cluster with GPU nodes
kubebid cluster create \
  --name ml-inference \
  --region us-west-2 \
  --node-type a100-1x \
  --nodes 2 \
  --max-bid 4.50 \
  --bid-strategy balanced

# Get your kubeconfig
kubebid cluster kubeconfig ml-inference > ~/.kube/config

The cluster should be ready in about 30 seconds. Let's verify:

$ kubectl get nodes
NAME              STATUS   ROLES    AGE   VERSION
kb-gpu-node-001   Ready    worker   45s   v1.28.3
kb-gpu-node-002   Ready    worker   42s   v1.28.3

$ kubectl describe node kb-gpu-node-001 | grep nvidia
  nvidia.com/gpu:     1

Step 2: Build the Inference Container

Let's create a simple inference server using FastAPI and vLLM for efficient Llama 2 serving:

# app/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams

app = FastAPI()

# Load model on startup
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9
)

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7

class GenerateResponse(BaseModel):
    text: str
    tokens_generated: int

@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
    sampling_params = SamplingParams(
        temperature=request.temperature,
        max_tokens=request.max_tokens
    )

    outputs = llm.generate([request.prompt], sampling_params)
    generated_text = outputs[0].outputs[0].text

    return GenerateResponse(
        text=generated_text,
        tokens_generated=len(outputs[0].outputs[0].token_ids)
    )

@app.get("/health")
async def health():
    return {"status": "healthy"}

And the Dockerfile:

# Dockerfile
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3 python3-pip

WORKDIR /app

COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

COPY app/ ./app/

# Download model weights at build time
RUN python3 -c "from huggingface_hub import snapshot_download; \
    snapshot_download('meta-llama/Llama-2-7b-chat-hf')"

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

# requirements.txt
fastapi==0.115.0
uvicorn==0.32.0
vllm==0.6.3
pydantic==2.9.2

Build and push to a registry:

# Build the image
docker build -t myorg/llama2-inference:v1 .

# Push to your registry (or KubeBid's built-in registry)
docker push myorg/llama2-inference:v1

Step 3: Deploy to KubeBid

Create the Kubernetes manifests:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama2-inference
  annotations:
    kubebid.io/bid-strategy: balanced
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama2-inference
  template:
    metadata:
      labels:
        app: llama2-inference
    spec:
      containers:
      - name: inference
        image: myorg/llama2-inference:v1
        ports:
        - containerPort: 8000
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "4"
          limits:
            nvidia.com/gpu: 1
            memory: "64Gi"
            cpu: "8"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
---
apiVersion: v1
kind: Service
metadata:
  name: llama2-inference
spec:
  selector:
    app: llama2-inference
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llama2-inference
  annotations:
    kubernetes.io/ingress.class: kubebid
spec:
  rules:
  - host: llama2.your-domain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: llama2-inference
            port:
              number: 80

Deploy:

Important: Llama 2 requires you to accept Meta's license agreement on HuggingFace. Visit the model page and request access before proceeding.

# Create the HuggingFace token secret
# Get your token from https://huggingface.co/settings/tokens
kubectl create secret generic hf-token \
  --from-literal=token=$HUGGING_FACE_TOKEN

# Apply the manifests
kubectl apply -f deployment.yaml

# Watch the rollout (this may take a few minutes as the model loads)
kubectl rollout status deployment/llama2-inference

Step 4: Add Autoscaling

Let's add autoscaling that considers both request load and GPU prices:

# autoscaler.yaml
apiVersion: kubebid.io/v1
kind: BidAwareHPA
metadata:
  name: llama2-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama2-inference
  minReplicas: 1
  maxReplicas: 10
  metrics:
  # Scale on GPU utilization
  - type: Resource
    resource:
      name: nvidia.com/gpu
      target:
        type: Utilization
        averageUtilization: 70
  # Scale on request latency
  - type: Pods
    pods:
      metric:
        name: http_request_duration_seconds
      target:
        type: AverageValue
        averageValue: "500m"  # 500ms
  # Price-aware scaling
  bidBehavior:
    scaleUpPriceThreshold: 3.00   # Scale up when price < $3
    scaleDownPriceThreshold: 5.00 # Scale down when price > $5
    priceCheckInterval: 60s

This autoscaler will:
- Scale up when GPU utilization exceeds 70% OR prices drop below $3/hr
- Scale down when prices exceed $5/hr
- Maintain between 1-10 replicas

Step 5: Test Your Deployment

Let's test the inference endpoint:

# Get the ingress URL
ENDPOINT=$(kubectl get ingress llama2-inference -o jsonpath='{.spec.rules[0].host}')

# Test the endpoint
curl -X POST "https://$ENDPOINT/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain Kubernetes in simple terms:",
    "max_tokens": 200,
    "temperature": 0.7
  }'

Expected response:

{
  "text": "Kubernetes is like a smart manager for your applications...",
  "tokens_generated": 156
}

Monitor Your Costs

Check your current spend and savings:

$ kubebid billing summary --cluster ml-inference

Cluster: ml-inference
Period: Last 24 hours

  Instance Type    Hours    Avg Price    On-Demand    Savings
  ─────────────────────────────────────────────────────────────
  a100-1x          48.0     $3.12/hr     $8.50/hr     63.3%

  Total Spend:     $149.76
  Compared to:     $408.00 (on-demand)
  You Saved:       $258.24

Best Practices

A few tips for running ML workloads on KubeBid:

Use model caching: Store model weights on a persistent volume to avoid re-downloading on pod restarts.
Set appropriate bid strategies: Use "cost-optimized" for batch inference, "balanced" for interactive services.
Implement graceful shutdown: Handle SIGTERM to complete in-flight requests before pre-emption.
Monitor GPU utilization: Under-utilized GPUs are wasted money, even at auction prices.

Next Steps

Now that you have a basic inference service running, you might want to:

Set up GitOps with ArgoCD for automated deployments
Learn about advanced bid strategies to optimize costs further
Add request batching for higher throughput
Implement model versioning and A/B testing

Questions? Join our Discord community or reach out to support@kubebid.io.

David Wang is a Developer Advocate at KubeBid, focused on ML/AI infrastructure. Follow him on Twitter @davidwang_ml.