gRPC Protocol Definition

Overview

The Strata Coordinator uses gRPC as its primary communication layer for worker orchestration. This high-performance, bidirectional interface handles worker lifecycle management, data shard distribution, and synchronization barriers.

For developers integrating new worker types (e.g., custom C++ trainers or specialized hardware runners), the gRPC protocol defines the contract for interacting with the Strata ecosystem.

Service Definition

The Coordinator service is the central entry point for all training workers.

service Coordinator {
    // Lifecycle Management
    rpc RegisterWorker(WorkerInfo) returns (WorkerConfig);
    rpc Heartbeat(HeartbeatRequest) returns (HeartbeatResponse);

    // Data Orchestration
    rpc GetNextShards(ShardRequest) returns (ShardAssignment);
    rpc AckDataset(DatasetInfo) returns (DatasetAck);

    // Synchronization & State
    rpc Barrier(BarrierRequest) returns (BarrierResponse);
    rpc ReportCheckpoint(CheckpointInfo) returns (CheckpointAck);
    rpc RequestRecovery(RecoveryRequest) returns (RecoveryResponse);
}

Lifecycle RPCs

RegisterWorker

Initializes the handshake between a training worker and the coordinator. This must be called before any other RPC.

Input: WorkerInfo (IP, Port, GPU count, and unique Worker ID).
Output: WorkerConfig (Assigned heartbeat intervals and system-wide training parameters).

Heartbeat

Workers must send heartbeats at the interval defined during registration (default: 5s). This RPC provides telemetry and health status to the coordinator.

Input: HeartbeatRequest
- worker_id: Unique identifier.
- status: Current WorkerStatus (Training, Idle, etc.).
- resources: ResourceUsage metrics (CPU, Memory, Network, GPU).
Output: HeartbeatResponse
- command: Optional instruction from the coordinator (e.g., "Stop", "Rebalance").

Data & Sharding RPCs

GetNextShards

Used by workers to request their next batch of data work-units based on the consistent hashing logic.

Input: ShardRequest
- dataset_id: The registered dataset name.
- epoch: The current training epoch.
Output: ShardAssignment
- shard_id: Unique ID of the shard.
- file_paths: List of physical file locations (S3 or Local).
- start_index / end_index: Range within the dataset.

AckDataset

Confirms that a worker has successfully pre-fetched or validated a dataset's metadata.

Synchronization RPCs

Barrier

Implements a distributed barrier for synchronous SGD or epoch-level coordination.

Input: BarrierRequest
- barrier_id: Unique string (usually "epoch_N" or "step_M").
- step: Current global training step.
Output: BarrierResponse
- The call blocks (or returns a status) until the expected_participants threshold is met.

ReportCheckpoint

Notifies the coordinator that a local checkpoint has been successfully persisted to the storage backend (S3/Local).

Input: CheckpointInfo
- checkpoint_id: Unique UUID.
- path: The URI where the checkpoint is stored.
- size_bytes: Total size of the saved state.
Output: CheckpointAck

Core Data Schemas

WorkerStatus

The state machine of a worker as perceived by the coordinator and the dashboard.

ResourceUsage

Telemetry data sent in every heartbeat for real-time monitoring.

message ResourceUsage {
    double cpu_percent = 1;
    uint64 memory_used_bytes = 2;
    uint64 disk_read_bytes = 3;
    uint64 disk_write_bytes = 4;
    uint64 network_rx_bytes = 5;
    uint64 network_tx_bytes = 6;
    repeated GpuMetrics gpu_metrics = 7;
}

Error Handling

Strata uses standard gRPC status codes to communicate infrastructure-level issues:

FAILED_PRECONDITION: Attempting to request shards before registration.
UNAVAILABLE: Coordinator is in the middle of a rebalance or failover.
NOT_FOUND: Worker ID is unknown (likely due to a timeout/eviction from the registry).
DEADLINE_EXCEEDED: Barrier synchronization took longer than the configured timeout.