Quick Start
Quick Start
Launch a localized cluster and run your first distributed training simulation in under 5 minutes.
Prerequisites
Before starting, ensure you have the following installed:
- Docker and Docker Compose
- Rust 1.75+ (only required for local development)
- Python 3.9+ (for training script integration)
1. Launch the Cluster
The fastest way to get started is using Docker Compose. This command spins up a central Coordinator, four simulated Workers, and the Web Dashboard.
# Clone the repository
git clone https://github.com/syrilj/Strata.git
cd Strata
# Start the services
docker-compose up --build
Once the containers are running:
- Dashboard: View real-time cluster state at http://localhost:3000
- Coordinator API: REST endpoints available at
http://localhost:8080 - gRPC Interface: Workers connect via
localhost:50051
2. Run Your First Training Job
You can initiate a training task directly through the Dashboard UI or by using the REST API. A task instructs the coordinator to assign shards and synchronize barriers across the worker pool.
Via REST API
curl -X POST http://localhost:8080/api/tasks \
-H "Content-Type: application/json" \
-d '{
"name": "Quickstart-Simulation",
"type": "training",
"dataset_id": "imagenet-mock",
"worker_count": 4,
"config": {
"epochs": 10,
"batch_size": 32
}
}'
What Happens Next:
- Shard Assignment: The
data-shardcrate uses consistent hashing to assign dataset segments to the 4 workers. - Barrier Sync: Workers will hit a synchronization barrier at the end of each simulated epoch.
- Checkpointing: The coordinator triggers a state save, visible in the "Checkpoints" tab of the dashboard.
3. Integrate with Python
To use Strata in your own training scripts, use the dtruntime Python bindings. This allows your Python training loop to interact with the high-performance Rust core.
from dtruntime import TrainingOrchestrator, CheckpointManager
# Initialize the orchestrator (connects to Rust coordinator)
orchestrator = TrainingOrchestrator(coordinator_url="localhost:50051")
# Register this worker
worker_info = orchestrator.register_worker(worker_id="worker-01")
# Get assigned data shards for the current epoch
shards = orchestrator.get_shards(dataset_id="my_dataset", epoch=0)
for shard in shards:
print(f"Processing shard {shard.id} from {shard.start_index} to {shard.end_index}")
# ... Your training loop here ...
# Synchronize with other workers
orchestrator.wait_at_barrier("epoch_sync", timeout_seconds=30)
# Save a checkpoint via the high-performance Rust backend
ckpt = CheckpointManager(storage_path="s3://my-bucket/checkpoints")
ckpt.save(model_state_bytes, step=1000, epoch=1)
4. Development Mode (Optional)
If you want to run the coordinator locally for debugging without Docker:
# Start the coordinator service
cargo run -p coordinator
# In a separate terminal, start the dashboard
cd dashboard
npm install
npm run dev
Environment Configuration
The runtime behavior can be customized via a .env file. Copy the example to get started:
cp .env.example .env
| Variable | Default | Description |
| :--- | :--- | :--- |
| STORAGE_BACKEND | local | Set to s3 for AWS/Minio integration. |
| CHECKPOINT_BUCKET | - | The S3 bucket name for state persistence. |
| RUST_LOG | info | Adjust verbosity (debug, info, warn, error). |