gRPC Protocol Definition
Overview
The Strata Coordinator uses gRPC as its primary communication layer for worker orchestration. This high-performance, bidirectional interface handles worker lifecycle management, data shard distribution, and synchronization barriers.
For developers integrating new worker types (e.g., custom C++ trainers or specialized hardware runners), the gRPC protocol defines the contract for interacting with the Strata ecosystem.
Service Definition
The Coordinator service is the central entry point for all training workers.
service Coordinator {
// Lifecycle Management
rpc RegisterWorker(WorkerInfo) returns (WorkerConfig);
rpc Heartbeat(HeartbeatRequest) returns (HeartbeatResponse);
// Data Orchestration
rpc GetNextShards(ShardRequest) returns (ShardAssignment);
rpc AckDataset(DatasetInfo) returns (DatasetAck);
// Synchronization & State
rpc Barrier(BarrierRequest) returns (BarrierResponse);
rpc ReportCheckpoint(CheckpointInfo) returns (CheckpointAck);
rpc RequestRecovery(RecoveryRequest) returns (RecoveryResponse);
}
Lifecycle RPCs
RegisterWorker
Initializes the handshake between a training worker and the coordinator. This must be called before any other RPC.
- Input:
WorkerInfo(IP, Port, GPU count, and unique Worker ID). - Output:
WorkerConfig(Assigned heartbeat intervals and system-wide training parameters).
Heartbeat
Workers must send heartbeats at the interval defined during registration (default: 5s). This RPC provides telemetry and health status to the coordinator.
- Input:
HeartbeatRequestworker_id: Unique identifier.status: CurrentWorkerStatus(Training, Idle, etc.).resources:ResourceUsagemetrics (CPU, Memory, Network, GPU).
- Output:
HeartbeatResponsecommand: Optional instruction from the coordinator (e.g., "Stop", "Rebalance").
Data & Sharding RPCs
GetNextShards
Used by workers to request their next batch of data work-units based on the consistent hashing logic.
- Input:
ShardRequestdataset_id: The registered dataset name.epoch: The current training epoch.
- Output:
ShardAssignmentshard_id: Unique ID of the shard.file_paths: List of physical file locations (S3 or Local).start_index/end_index: Range within the dataset.
AckDataset
Confirms that a worker has successfully pre-fetched or validated a dataset's metadata.
Synchronization RPCs
Barrier
Implements a distributed barrier for synchronous SGD or epoch-level coordination.
- Input:
BarrierRequestbarrier_id: Unique string (usually "epoch_N" or "step_M").step: Current global training step.
- Output:
BarrierResponse- The call blocks (or returns a status) until the
expected_participantsthreshold is met.
- The call blocks (or returns a status) until the
ReportCheckpoint
Notifies the coordinator that a local checkpoint has been successfully persisted to the storage backend (S3/Local).
- Input:
CheckpointInfocheckpoint_id: Unique UUID.path: The URI where the checkpoint is stored.size_bytes: Total size of the saved state.
- Output:
CheckpointAck
Core Data Schemas
WorkerStatus
The state machine of a worker as perceived by the coordinator and the dashboard.
| State | Description |
| :--- | :--- |
| INITIALIZING | Worker is starting up and loading libraries. |
| IDLE | Worker is connected but not yet assigned to a task. |
| LOADING_DATA | Worker is pulling shards from the coordinator. |
| TRAINING | Active compute phase. |
| CHECKPOINTING | Currently writing state to the storage backend. |
| RECOVERING | Reloading state after a failure. |
| ERROR | Fatal failure state; requires manual or automated restart. |
ResourceUsage
Telemetry data sent in every heartbeat for real-time monitoring.
message ResourceUsage {
double cpu_percent = 1;
uint64 memory_used_bytes = 2;
uint64 disk_read_bytes = 3;
uint64 disk_write_bytes = 4;
uint64 network_rx_bytes = 5;
uint64 network_tx_bytes = 6;
repeated GpuMetrics gpu_metrics = 7;
}
Error Handling
Strata uses standard gRPC status codes to communicate infrastructure-level issues:
FAILED_PRECONDITION: Attempting to request shards before registration.UNAVAILABLE: Coordinator is in the middle of a rebalance or failover.NOT_FOUND: Worker ID is unknown (likely due to a timeout/eviction from the registry).DEADLINE_EXCEEDED: Barrier synchronization took longer than the configured timeout.