Kubernetes Orchestration

Overview

Strata is designed to scale horizontally across Kubernetes clusters, supporting training jobs that span hundreds to thousands of workers. For production environments, we recommend using the provided Helm charts to manage the lifecycle of the Coordinator, Workers, and the Web Dashboard.

Kubernetes orchestration ensures high availability for the Coordinator and allows for dynamic scaling of worker pods based on the training job requirements.

Helm Chart Configuration

The deployment is managed via a unified Helm chart located in /deploy/charts/strata. The primary configuration is handled through values.yaml.

Core Configuration (values.yaml)

coordinator:
  replicaCount: 1 # Typically 1 for consistent state management
  service:
    type: ClusterIP
    grpcPort: 50051
    httpPort: 8080
  resources:
    limits:
      cpu: 2
      memory: 4Gi

workers:
  replicaCount: 4 # Initial number of training workers
  image: syrilj/strata-worker:latest
  env:
    - name: COORDINATOR_URL
      value: "strata-coordinator:50051"
  resources:
    limits:
      nvidia.com/gpu: 1 # Assigning GPU resources per worker

storage:
  backend: "s3" # Options: "s3" or "local"
  s3:
    bucket: "my-ml-checkpoints"
    region: "us-east-1"

Persistence and Storage

Strata supports two primary persistence models on Kubernetes:

1. Cloud-Native (S3/GCS/Azure Blob)

The recommended approach for large-scale training. Checkpoints are written directly to S3-compatible storage. This allows workers to be ephemeral (Spot Instances).

Configuration: Set STORAGE_BACKEND=s3 in your environment.
Security: Use Kubernetes Secrets or IAM Roles for Service Accounts (IRSA) to provide credentials.

2. Local Persistence (PVCs)

For on-premise clusters or low-latency checkpointing, you can use Persistent Volume Claims.

persistence:
  enabled: true
  storageClass: "fast-nvme"
  accessMode: ReadWriteMany
  size: 500Gi

When using STORAGE_BACKEND=local, the Helm chart mounts the PVC to /data/checkpoints across all worker pods to ensure consistent state access during recovery.

Deploying to the Cluster

Add the Repository (or use local path):

helm install strata ./deploy/charts/strata --namespace ml-training

Verify the Deployment:

kubectl get pods -n ml-training
# Expected: 1 coordinator, N workers, 1 dashboard

Access the Dashboard: The dashboard connects to the Coordinator's HTTP API. You can expose it via an Ingress or Port-forward:
```
kubectl port-forward svc/strata-dashboard 3000:3000
```

Scaling Workers

Scaling training capacity is managed by adjusting the replicaCount of the worker deployment. Strata's Consistent Hashing mechanism automatically rebalances data shards when new workers join the cluster.

Manual Scaling

kubectl scale deployment strata-worker --replicas=32

Autoscaling

You can define a Horizontal Pod Autoscaler (HPA) to scale workers based on custom metrics (e.g., shard processing queue depth):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: strata-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: strata-worker
  minReplicas: 4
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80

Fault Tolerance & Self-Healing

The Kubernetes orchestration layer interacts with Strata's internal fault tolerance:

Liveness/Readiness Probes: The Coordinator gRPC service includes health checks. If the Coordinator fails, Kubernetes restarts the pod, and workers will automatically re-register via their heartbeat loop.
Node Eviction: If a worker pod is evicted, the Coordinator detects the missing heartbeat, marks the worker as FAILED, and triggers a shard re-assignment to the remaining healthy workers.
Checkpoint Recovery: Upon pod restart, workers query the Coordinator for the latest globally consistent checkpoint stored in S3/PVC and resume training from that specific step/epoch.