Kubernetes Orchestration
Overview
Strata is designed to scale horizontally across Kubernetes clusters, supporting training jobs that span hundreds to thousands of workers. For production environments, we recommend using the provided Helm charts to manage the lifecycle of the Coordinator, Workers, and the Web Dashboard.
Kubernetes orchestration ensures high availability for the Coordinator and allows for dynamic scaling of worker pods based on the training job requirements.
Helm Chart Configuration
The deployment is managed via a unified Helm chart located in /deploy/charts/strata. The primary configuration is handled through values.yaml.
Core Configuration (values.yaml)
coordinator:
replicaCount: 1 # Typically 1 for consistent state management
service:
type: ClusterIP
grpcPort: 50051
httpPort: 8080
resources:
limits:
cpu: 2
memory: 4Gi
workers:
replicaCount: 4 # Initial number of training workers
image: syrilj/strata-worker:latest
env:
- name: COORDINATOR_URL
value: "strata-coordinator:50051"
resources:
limits:
nvidia.com/gpu: 1 # Assigning GPU resources per worker
storage:
backend: "s3" # Options: "s3" or "local"
s3:
bucket: "my-ml-checkpoints"
region: "us-east-1"
Persistence and Storage
Strata supports two primary persistence models on Kubernetes:
1. Cloud-Native (S3/GCS/Azure Blob)
The recommended approach for large-scale training. Checkpoints are written directly to S3-compatible storage. This allows workers to be ephemeral (Spot Instances).
- Configuration: Set
STORAGE_BACKEND=s3in your environment. - Security: Use Kubernetes Secrets or IAM Roles for Service Accounts (IRSA) to provide credentials.
2. Local Persistence (PVCs)
For on-premise clusters or low-latency checkpointing, you can use Persistent Volume Claims.
persistence:
enabled: true
storageClass: "fast-nvme"
accessMode: ReadWriteMany
size: 500Gi
When using STORAGE_BACKEND=local, the Helm chart mounts the PVC to /data/checkpoints across all worker pods to ensure consistent state access during recovery.
Deploying to the Cluster
-
Add the Repository (or use local path):
helm install strata ./deploy/charts/strata --namespace ml-training -
Verify the Deployment:
kubectl get pods -n ml-training # Expected: 1 coordinator, N workers, 1 dashboard -
Access the Dashboard: The dashboard connects to the Coordinator's HTTP API. You can expose it via an Ingress or Port-forward:
kubectl port-forward svc/strata-dashboard 3000:3000
Scaling Workers
Scaling training capacity is managed by adjusting the replicaCount of the worker deployment. Strata's Consistent Hashing mechanism automatically rebalances data shards when new workers join the cluster.
Manual Scaling
kubectl scale deployment strata-worker --replicas=32
Autoscaling
You can define a Horizontal Pod Autoscaler (HPA) to scale workers based on custom metrics (e.g., shard processing queue depth):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: strata-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: strata-worker
minReplicas: 4
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
Fault Tolerance & Self-Healing
The Kubernetes orchestration layer interacts with Strata's internal fault tolerance:
- Liveness/Readiness Probes: The Coordinator gRPC service includes health checks. If the Coordinator fails, Kubernetes restarts the pod, and workers will automatically re-register via their heartbeat loop.
- Node Eviction: If a worker pod is evicted, the Coordinator detects the missing heartbeat, marks the worker as
FAILED, and triggers a shard re-assignment to the remaining healthy workers. - Checkpoint Recovery: Upon pod restart, workers query the Coordinator for the latest globally consistent checkpoint stored in S3/PVC and resume training from that specific step/epoch.