Quick Start

Launch a localized cluster and run your first distributed training simulation in under 5 minutes.

Prerequisites

Before starting, ensure you have the following installed:

Docker and Docker Compose
Rust 1.75+ (only required for local development)
Python 3.9+ (for training script integration)

1. Launch the Cluster

The fastest way to get started is using Docker Compose. This command spins up a central Coordinator, four simulated Workers, and the Web Dashboard.

# Clone the repository
git clone https://github.com/syrilj/Strata.git
cd Strata

# Start the services
docker-compose up --build

Once the containers are running:

Dashboard: View real-time cluster state at http://localhost:3000
Coordinator API: REST endpoints available at http://localhost:8080
gRPC Interface: Workers connect via localhost:50051

2. Run Your First Training Job

You can initiate a training task directly through the Dashboard UI or by using the REST API. A task instructs the coordinator to assign shards and synchronize barriers across the worker pool.

Via REST API

curl -X POST http://localhost:8080/api/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Quickstart-Simulation",
    "type": "training",
    "dataset_id": "imagenet-mock",
    "worker_count": 4,
    "config": {
      "epochs": 10,
      "batch_size": 32
    }
  }'

What Happens Next:

Shard Assignment: The data-shard crate uses consistent hashing to assign dataset segments to the 4 workers.
Barrier Sync: Workers will hit a synchronization barrier at the end of each simulated epoch.
Checkpointing: The coordinator triggers a state save, visible in the "Checkpoints" tab of the dashboard.

3. Integrate with Python

To use Strata in your own training scripts, use the dtruntime Python bindings. This allows your Python training loop to interact with the high-performance Rust core.

from dtruntime import TrainingOrchestrator, CheckpointManager

# Initialize the orchestrator (connects to Rust coordinator)
orchestrator = TrainingOrchestrator(coordinator_url="localhost:50051")

# Register this worker
worker_info = orchestrator.register_worker(worker_id="worker-01")

# Get assigned data shards for the current epoch
shards = orchestrator.get_shards(dataset_id="my_dataset", epoch=0)

for shard in shards:
    print(f"Processing shard {shard.id} from {shard.start_index} to {shard.end_index}")
    
    # ... Your training loop here ...
    
    # Synchronize with other workers
    orchestrator.wait_at_barrier("epoch_sync", timeout_seconds=30)

# Save a checkpoint via the high-performance Rust backend
ckpt = CheckpointManager(storage_path="s3://my-bucket/checkpoints")
ckpt.save(model_state_bytes, step=1000, epoch=1)

4. Development Mode (Optional)

If you want to run the coordinator locally for debugging without Docker:

# Start the coordinator service
cargo run -p coordinator

# In a separate terminal, start the dashboard
cd dashboard
npm install
npm run dev

Environment Configuration

The runtime behavior can be customized via a .env file. Copy the example to get started:

cp .env.example .env