Synchronization Barriers
Overview
Synchronization barriers in Strata ensure that distributed workers maintain alignment during training—typically at the end of an epoch or a specific training step. The Coordinator service acts as the central authority, managing barrier states and releasing workers only when the required number of participants have "arrived."
Strata provides a low-latency implementation using gRPC and asynchronous notification primitives to achieve a p99 latency of <50ms for 100 workers.
Worker Synchronization (gRPC)
Workers synchronize by calling the Barrier method on the Coordinator gRPC service. This is a blocking operation from the worker's perspective; the Coordinator holds the request until the barrier condition is met.
Barrier Request
To join a barrier, a worker sends a BarrierRequest containing:
| Field | Type | Description |
| :--- | :--- | :--- |
| barrier_id | string | Unique identifier for the sync point (e.g., "epoch_5_end"). |
| worker_id | string | The unique ID of the worker arriving at the barrier. |
| step | uint64 | The training step associated with this synchronization. |
Barrier Response
Once the barrier is released, the Coordinator returns:
| Field | Type | Description |
| :--- | :--- | :--- |
| success | bool | Whether the synchronization was successful. |
| arrival_order | uint32 | The sequence number in which this worker arrived (1-indexed). |
Example: Worker Implementation (Python)
Using the Python bindings, workers typically wrap barrier calls around training loops:
from strata import TrainingOrchestrator
orchestrator = TrainingOrchestrator(coordinator_url="localhost:50051")
for epoch in range(total_epochs):
# ... training logic ...
# Synchronize all workers before starting next epoch
print(f"Worker {worker_id} reached end of epoch {epoch}")
sync_info = orchestrator.wait_at_barrier(
barrier_id=f"epoch_{epoch}",
step=current_step
)
print(f"Barrier released! We were worker #{sync_info.arrival_order} to arrive.")
Internal Core Logic
While workers interact via gRPC, the runtime-core crate defines the logic for barrier state transitions.
BarrierState
The BarrierState tracks the lifecycle of a single synchronization point:
pub struct BarrierState {
pub id: BarrierId,
pub step: Step,
pub expected_participants: usize,
pub arrived_workers: Vec<WorkerId>,
pub released: bool,
pub created_at: DateTime<Utc>,
pub released_at: Option<DateTime<Utc>>,
}
Key behaviors:
- Arrival: When a worker arrives, they are added to
arrived_workers. If the count matchesexpected_participants,releasedbecomestrue. - Deterministic Order: The
arrival_ordercan be used by workers to elect a "leader" for specific tasks (e.g., only the first worker to arrive performs an expensive logging operation).
Monitoring and Observability
You can monitor active and historical barriers through the Dashboard or the HTTP REST API.
REST API Endpoint
GET /api/barriers
Response Example:
[
{
"id": "epoch_12",
"name": "Global Step Sync",
"arrived": 8,
"total": 10,
"status": "waiting",
"created_at": 1715600000
}
]
Metrics
The Coordinator tracks the following barrier-related performance metrics:
barrier_latency_p99: The time elapsed between the first worker's arrival and the final worker's arrival (release).active_barriers: Number of synchronization points currently awaiting participants.
Configuration
Barrier behavior is influenced by the Coordinator's worker registry settings. Ensure the max_workers and heartbeat_timeout are configured correctly in the CoordinatorService to prevent "stale" workers from hanging a barrier indefinitely.
If a worker fails (stops sending heartbeats), the Coordinator's fault tolerance logic can be configured to release the barrier with an error or rebalance the expected participant count.