Architecture Overview
Strata is designed as a high-performance, distributed hub-and-spoke system. At its core, a central Coordinator manages the global state, while distributed Workers (typically GPU-accelerated training processes) interact with the runtime via gRPC or Python bindings to handle data ingestion and state persistence.
Core Components
The Strata ecosystem is composed of several specialized crates designed for horizontal scalability and low-latency synchronization.
1. The Coordinator (crates/coordinator)
The Coordinator is the "brain" of the system. It is a high-performance gRPC server built with tonic and tokio. Its primary responsibilities include:
- Worker Lifecycle: Managing registration, health checks (heartbeats), and failure detection.
- Barrier Synchronization: Coordinating lock-step execution across thousands of workers.
- Global Metadata: Tracking dataset registration and checkpoint versions.
HTTP API for Monitoring
The coordinator also exposes a REST API (via axum) for dashboard integration.
- Endpoint:
GET /api/dashboard - Response: Returns the full system state including
workers,datasets,checkpoints, andmetrics.
2. Data Sharding (crates/data-shard)
Strata implements a deterministic sharding logic to ensure that training data is distributed evenly and efficiently. It utilizes Consistent Hashing to minimize data movement when workers join or leave the cluster.
Key Interface: ShardManager
The ShardManager allows you to register datasets and calculate which worker should process which shard for a specific epoch.
// Register a dataset for distributed loading
manager.register_dataset_params(
"imagenet-1k",
1281167, // total samples
10000, // samples per shard
true, // enable shuffling
42 // random seed
);
// Get assignments for a specific worker and epoch
let assignments = manager.get_shard_for_worker("imagenet-1k", "worker-01", 0);
3. Distributed Checkpointing (crates/checkpoint)
Checkpointing in Strata is designed to be non-blocking. The CheckpointManager handles the coordination of saving model weights and optimizer states across the fleet.
- Non-blocking I/O: Leverages async Rust to stream data to storage without stalling the training loop.
- Consistency: Uses the Coordinator's barrier sync to ensure all workers have reached the same global step before finalizing a checkpoint.
4. Pluggable Storage (crates/storage)
Strata abstracts storage through the StorageBackend trait, allowing seamless switching between local development and cloud production environments.
- Local Storage: For development or single-node testing.
- S3 Storage: High-throughput persistence for large-scale production clusters.
Configuration Example:
# Set storage backend via environment variables
STORAGE_BACKEND=s3
CHECKPOINT_BUCKET=my-ml-checkpoints
AWS_REGION=us-west-2
Python Interop (crates/python-bindings)
To support the ML ecosystem (PyTorch, JAX, TensorFlow), Strata provides high-performance Python bindings authored in Rust using PyO3. This allows researchers to use Rust's safety and performance with standard Pythonic syntax.
from strata import DatasetRegistry, CheckpointManager
# Initialize the checkpoint manager targeting S3
ckpt_manager = CheckpointManager(path="s3://my-bucket/models", keep_count=5)
# Save state from within the training loop
ckpt_manager.save(model_state_bytes, step=5000, epoch=2)
Communication Flow
- Registration: A worker starts and registers with the Coordinator via gRPC, reporting its GPU count and network address.
- Assignment: The worker requests a ShardAssignment. The Coordinator uses the
ShardManagerto return a list of data indices based on the current epoch and cluster topology. - Synchronization: During training, workers hit a Barrier. The Coordinator holds the response until the
expected_participantscount is met. - Persistence: When a checkpoint is triggered, workers stream data to the Storage Backend. Once successful, they send a
CheckpointAckto the Coordinator to finalize the metadata.
Observability
Strata includes a real-time React dashboard that connects to the Coordinator's HTTP API. It provides a visual overview of:
- System Throughput: Checkpoint MB/s and Coordinator Requests Per Second (RPS).
- Worker Health: Real-time status (Training, Checkpointing, Idle, or Failed).
- Data Distribution: Visual breakdown of shard assignments across the fleet.
- Barrier Latency: p99 metrics for synchronization bottlenecks.