Synchronization Barriers

Overview

Synchronization barriers in Strata ensure that distributed workers maintain alignment during training—typically at the end of an epoch or a specific training step. The Coordinator service acts as the central authority, managing barrier states and releasing workers only when the required number of participants have "arrived."

Strata provides a low-latency implementation using gRPC and asynchronous notification primitives to achieve a p99 latency of <50ms for 100 workers.

Worker Synchronization (gRPC)

Workers synchronize by calling the Barrier method on the Coordinator gRPC service. This is a blocking operation from the worker's perspective; the Coordinator holds the request until the barrier condition is met.

Barrier Request

To join a barrier, a worker sends a BarrierRequest containing:

Barrier Response

Once the barrier is released, the Coordinator returns:

Example: Worker Implementation (Python)

Using the Python bindings, workers typically wrap barrier calls around training loops:

from strata import TrainingOrchestrator

orchestrator = TrainingOrchestrator(coordinator_url="localhost:50051")

for epoch in range(total_epochs):
    # ... training logic ...
    
    # Synchronize all workers before starting next epoch
    print(f"Worker {worker_id} reached end of epoch {epoch}")
    sync_info = orchestrator.wait_at_barrier(
        barrier_id=f"epoch_{epoch}",
        step=current_step
    )
    
    print(f"Barrier released! We were worker #{sync_info.arrival_order} to arrive.")

Internal Core Logic

While workers interact via gRPC, the runtime-core crate defines the logic for barrier state transitions.

`BarrierState`

The BarrierState tracks the lifecycle of a single synchronization point:

pub struct BarrierState {
    pub id: BarrierId,
    pub step: Step,
    pub expected_participants: usize,
    pub arrived_workers: Vec<WorkerId>,
    pub released: bool,
    pub created_at: DateTime<Utc>,
    pub released_at: Option<DateTime<Utc>>,
}

Key behaviors:

Arrival: When a worker arrives, they are added to arrived_workers. If the count matches expected_participants, released becomes true.
Deterministic Order: The arrival_order can be used by workers to elect a "leader" for specific tasks (e.g., only the first worker to arrive performs an expensive logging operation).

Monitoring and Observability

You can monitor active and historical barriers through the Dashboard or the HTTP REST API.

REST API Endpoint

GET /api/barriers

Response Example:

[
  {
    "id": "epoch_12",
    "name": "Global Step Sync",
    "arrived": 8,
    "total": 10,
    "status": "waiting",
    "created_at": 1715600000
  }
]

Metrics

The Coordinator tracks the following barrier-related performance metrics:

barrier_latency_p99: The time elapsed between the first worker's arrival and the final worker's arrival (release).
active_barriers: Number of synchronization points currently awaiting participants.

Configuration

Barrier behavior is influenced by the Coordinator's worker registry settings. Ensure the max_workers and heartbeat_timeout are configured correctly in the CoordinatorService to prevent "stale" workers from hanging a barrier indefinitely.

If a worker fails (stops sending heartbeats), the Coordinator's fault tolerance logic can be configured to release the barrier with an error or rebalance the expected participant count.