Running ML inference can be expensive. GPU instances on major cloud providers cost $3-8/hour for capable hardware. But with KubeBid's auction pricing, you can often get the same GPUs for $1-3/hour. This tutorial shows you how to deploy a PyTorch model and start saving.
What You'll Build
A scalable inference service for a Llama 2 model, running on A100 GPUs with automatic scaling based on request volume and GPU price.
Prerequisites
Before we start, make sure you have:
- A KubeBid account (sign up free)
- The KubeBid CLI installed
- Docker installed locally
- Basic familiarity with Kubernetes
Step 1: Set Up Your Cluster
First, let's create a GPU cluster with auction pricing. We'll set a maximum bid of $4.50/hour for A100 GPUs (typically $8.50 on-demand):
# Install/update the CLI
curl -sSL https://get.kubebid.io | bash
# Login with your API key
kubebid auth login
# Create a cluster with GPU nodes
kubebid cluster create \
--name ml-inference \
--region us-west-2 \
--node-type a100-1x \
--nodes 2 \
--max-bid 4.50 \
--bid-strategy balanced
# Get your kubeconfig
kubebid cluster kubeconfig ml-inference > ~/.kube/config
The cluster should be ready in about 30 seconds. Let's verify:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kb-gpu-node-001 Ready worker 45s v1.28.3
kb-gpu-node-002 Ready worker 42s v1.28.3
$ kubectl describe node kb-gpu-node-001 | grep nvidia
nvidia.com/gpu: 1
Step 2: Build the Inference Container
Let's create a simple inference server using FastAPI and vLLM for efficient Llama 2 serving:
# app/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams
app = FastAPI()
# Load model on startup
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=1,
gpu_memory_utilization=0.9
)
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 256
temperature: float = 0.7
class GenerateResponse(BaseModel):
text: str
tokens_generated: int
@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
sampling_params = SamplingParams(
temperature=request.temperature,
max_tokens=request.max_tokens
)
outputs = llm.generate([request.prompt], sampling_params)
generated_text = outputs[0].outputs[0].text
return GenerateResponse(
text=generated_text,
tokens_generated=len(outputs[0].outputs[0].token_ids)
)
@app.get("/health")
async def health():
return {"status": "healthy"}
And the Dockerfile:
# Dockerfile
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pip
WORKDIR /app
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY app/ ./app/
# Download model weights at build time
RUN python3 -c "from huggingface_hub import snapshot_download; \
snapshot_download('meta-llama/Llama-2-7b-chat-hf')"
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
# requirements.txt
fastapi==0.115.0
uvicorn==0.32.0
vllm==0.6.3
pydantic==2.9.2
Build and push to a registry:
# Build the image
docker build -t myorg/llama2-inference:v1 .
# Push to your registry (or KubeBid's built-in registry)
docker push myorg/llama2-inference:v1
Step 3: Deploy to KubeBid
Create the Kubernetes manifests:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama2-inference
annotations:
kubebid.io/bid-strategy: balanced
spec:
replicas: 2
selector:
matchLabels:
app: llama2-inference
template:
metadata:
labels:
app: llama2-inference
spec:
containers:
- name: inference
image: myorg/llama2-inference:v1
ports:
- containerPort: 8000
resources:
requests:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "4"
limits:
nvidia.com/gpu: 1
memory: "64Gi"
cpu: "8"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
---
apiVersion: v1
kind: Service
metadata:
name: llama2-inference
spec:
selector:
app: llama2-inference
ports:
- port: 80
targetPort: 8000
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llama2-inference
annotations:
kubernetes.io/ingress.class: kubebid
spec:
rules:
- host: llama2.your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: llama2-inference
port:
number: 80
Deploy:
Important: Llama 2 requires you to accept Meta's license agreement on HuggingFace. Visit the model page and request access before proceeding.
# Create the HuggingFace token secret
# Get your token from https://huggingface.co/settings/tokens
kubectl create secret generic hf-token \
--from-literal=token=$HUGGING_FACE_TOKEN
# Apply the manifests
kubectl apply -f deployment.yaml
# Watch the rollout (this may take a few minutes as the model loads)
kubectl rollout status deployment/llama2-inference
Step 4: Add Autoscaling
Let's add autoscaling that considers both request load and GPU prices:
# autoscaler.yaml
apiVersion: kubebid.io/v1
kind: BidAwareHPA
metadata:
name: llama2-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llama2-inference
minReplicas: 1
maxReplicas: 10
metrics:
# Scale on GPU utilization
- type: Resource
resource:
name: nvidia.com/gpu
target:
type: Utilization
averageUtilization: 70
# Scale on request latency
- type: Pods
pods:
metric:
name: http_request_duration_seconds
target:
type: AverageValue
averageValue: "500m" # 500ms
# Price-aware scaling
bidBehavior:
scaleUpPriceThreshold: 3.00 # Scale up when price < $3
scaleDownPriceThreshold: 5.00 # Scale down when price > $5
priceCheckInterval: 60s
This autoscaler will:
- Scale up when GPU utilization exceeds 70% OR prices drop below $3/hr
- Scale down when prices exceed $5/hr
- Maintain between 1-10 replicas
Step 5: Test Your Deployment
Let's test the inference endpoint:
# Get the ingress URL
ENDPOINT=$(kubectl get ingress llama2-inference -o jsonpath='{.spec.rules[0].host}')
# Test the endpoint
curl -X POST "https://$ENDPOINT/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain Kubernetes in simple terms:",
"max_tokens": 200,
"temperature": 0.7
}'
Expected response:
{
"text": "Kubernetes is like a smart manager for your applications...",
"tokens_generated": 156
}
Monitor Your Costs
Check your current spend and savings:
$ kubebid billing summary --cluster ml-inference
Cluster: ml-inference
Period: Last 24 hours
Instance Type Hours Avg Price On-Demand Savings
─────────────────────────────────────────────────────────────
a100-1x 48.0 $3.12/hr $8.50/hr 63.3%
Total Spend: $149.76
Compared to: $408.00 (on-demand)
You Saved: $258.24
Best Practices
A few tips for running ML workloads on KubeBid:
- Use model caching: Store model weights on a persistent volume to avoid re-downloading on pod restarts.
- Set appropriate bid strategies: Use "cost-optimized" for batch inference, "balanced" for interactive services.
- Implement graceful shutdown: Handle SIGTERM to complete in-flight requests before pre-emption.
- Monitor GPU utilization: Under-utilized GPUs are wasted money, even at auction prices.
Next Steps
Now that you have a basic inference service running, you might want to:
- Set up GitOps with ArgoCD for automated deployments
- Learn about advanced bid strategies to optimize costs further
- Add request batching for higher throughput
- Implement model versioning and A/B testing
Questions? Join our Discord community or reach out to support@kubebid.io.
David Wang is a Developer Advocate at KubeBid, focused on ML/AI infrastructure. Follow him on Twitter @davidwang_ml.