Troubleshooting & Recovery
Worker Failures and Health Monitoring
The Strata runtime uses a heartbeat-based system to track worker health. Workers are expected to check in with the Coordinator at regular intervals.
Identifying Failed Workers
You can monitor worker status via the Dashboard or the HTTP API. A worker is marked as failed if it misses heartbeats for a duration exceeding the heartbeat_timeout (default: 30 seconds).
Via HTTP API:
curl http://localhost:3000/api/workers
Common Statuses:
active: Worker is sending heartbeats and performing tasks.failed: Heartbeat timed out; shards are being reassigned.recovering: Worker is pulling the latest checkpoint from storage.
Automatic Shard Rebalancing
When a worker is marked as failed, the Coordinator uses Consistent Hashing to reassign its shards to the remaining healthy workers. This ensures training continues without manual intervention.
- Impact: You will see
assigned_shardsincrease on healthy nodes. - Performance: Shard reassignment typically completes in <10ms for 1,000 workers.
Barrier Synchronization Issues
Barrier synchronization ensures all workers reach the same training step before proceeding. If one worker hangs or crashes, the entire job may stall at a barrier.
Troubleshooting Stalled Barriers
If the Dashboard shows a barrier in a waiting state for an extended period:
- Check Worker Logs: Identify if a specific
worker_idhas failed to reach the barrier. - Verify Network Latency: p99 latency should be <50ms. High latency usually indicates network congestion between the worker and the Coordinator.
- Manual Task Reset: If a worker cannot be recovered, use the Task API to stop and restart the training task:
# Stop the stalled task
curl -X POST http://localhost:3000/api/tasks/{task_id}/stop
Checkpoint Recovery
Strata supports automatic recovery of training state. When a task restarts or a worker joins a running job, it must synchronize its state with the latest checkpoint.
Manual Recovery via Python
If you need to manually trigger a recovery within a training script using the Python bindings:
from dtruntime import CheckpointManager, TrainingOrchestrator
# Initialize manager pointing to your S3/Local storage
ckpt_manager = CheckpointManager(path="s3://my-checkpoint-bucket/model-alpha")
# Retrieve the latest valid checkpoint metadata
latest_info = ckpt_manager.get_latest()
if latest_info:
print(f"Recovering from Epoch {latest_info.epoch}, Step {latest_info.step}")
model.load_state_dict(latest_info.path)
Recovery gRPC Failures
If a worker fails to receive a RecoveryResponse, check the following:
- Storage Connectivity: Ensure the worker has
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYconfigured if using S3. - Bucket Permissions: The worker node requires
s3:GetObjectands3:ListBucketpermissions. - Coordinator State: Ensure the Coordinator has registered the checkpoint. You can verify this at
/api/checkpoints.
Storage Bottlenecks
Strata provides high-performance async I/O, but underlying storage can still become a bottleneck.
S3 Throughput Issues
If checkpoint_throughput drops below expected levels (e.g., <200 MB/s):
- Multipart Uploads: Ensure large checkpoints are being written using the
AsyncCheckpointWriterwhich handles buffered streaming. - Region Mismatch: Verify that the
AWS_REGIONmatches the location of your S3 bucket to minimize cross-region latency. - Local Disk Cache: If local I/O is slow, check the
disk_write_bytesmetric in the Dashboard to identify if the worker's local NVMe is saturated.
Environment & Configuration Errors
| Symptom | Possible Cause | Resolution |
| :--- | :--- | :--- |
| ConnectionRefused on Worker startup | Coordinator is down or unreachable. | Check COORDINATOR_ADDR and ensure the gRPC port (50051) is open. |
| AccessDenied for S3 | Invalid AWS credentials. | Verify .env or IAM Role permissions. |
| Shards not assigning | No dataset registered. | Ensure dataset_id is registered via the API/Dashboard before starting workers. |
| Dashboard shows "Disconnected" | HTTP API is unreachable. | Ensure the Coordinator is running and the VITE_API_URL points to the correct address. |
Logging for Debugging
Increase the log verbosity of the Rust crates to capture detailed trace events:
# Set log level to debug for the coordinator and storage crates
export RUST_LOG=coordinator=debug,storage=debug,runtime_core=info
cargo run -p coordinator