Troubleshooting & Recovery

Worker Failures and Health Monitoring

The Strata runtime uses a heartbeat-based system to track worker health. Workers are expected to check in with the Coordinator at regular intervals.

Identifying Failed Workers

You can monitor worker status via the Dashboard or the HTTP API. A worker is marked as failed if it misses heartbeats for a duration exceeding the heartbeat_timeout (default: 30 seconds).

Via HTTP API:

curl http://localhost:3000/api/workers

Common Statuses:

active: Worker is sending heartbeats and performing tasks.
failed: Heartbeat timed out; shards are being reassigned.
recovering: Worker is pulling the latest checkpoint from storage.

Automatic Shard Rebalancing

When a worker is marked as failed, the Coordinator uses Consistent Hashing to reassign its shards to the remaining healthy workers. This ensures training continues without manual intervention.

Impact: You will see assigned_shards increase on healthy nodes.
Performance: Shard reassignment typically completes in <10ms for 1,000 workers.

Barrier Synchronization Issues

Barrier synchronization ensures all workers reach the same training step before proceeding. If one worker hangs or crashes, the entire job may stall at a barrier.

Troubleshooting Stalled Barriers

If the Dashboard shows a barrier in a waiting state for an extended period:

Check Worker Logs: Identify if a specific worker_id has failed to reach the barrier.
Verify Network Latency: p99 latency should be <50ms. High latency usually indicates network congestion between the worker and the Coordinator.
Manual Task Reset: If a worker cannot be recovered, use the Task API to stop and restart the training task:

# Stop the stalled task
curl -X POST http://localhost:3000/api/tasks/{task_id}/stop

Checkpoint Recovery

Strata supports automatic recovery of training state. When a task restarts or a worker joins a running job, it must synchronize its state with the latest checkpoint.

Manual Recovery via Python

If you need to manually trigger a recovery within a training script using the Python bindings:

from dtruntime import CheckpointManager, TrainingOrchestrator

# Initialize manager pointing to your S3/Local storage
ckpt_manager = CheckpointManager(path="s3://my-checkpoint-bucket/model-alpha")

# Retrieve the latest valid checkpoint metadata
latest_info = ckpt_manager.get_latest()

if latest_info:
    print(f"Recovering from Epoch {latest_info.epoch}, Step {latest_info.step}")
    model.load_state_dict(latest_info.path)

Recovery gRPC Failures

If a worker fails to receive a RecoveryResponse, check the following:

Storage Connectivity: Ensure the worker has AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY configured if using S3.
Bucket Permissions: The worker node requires s3:GetObject and s3:ListBucket permissions.
Coordinator State: Ensure the Coordinator has registered the checkpoint. You can verify this at /api/checkpoints.

Storage Bottlenecks

Strata provides high-performance async I/O, but underlying storage can still become a bottleneck.

S3 Throughput Issues

If checkpoint_throughput drops below expected levels (e.g., <200 MB/s):

Multipart Uploads: Ensure large checkpoints are being written using the AsyncCheckpointWriter which handles buffered streaming.
Region Mismatch: Verify that the AWS_REGION matches the location of your S3 bucket to minimize cross-region latency.
Local Disk Cache: If local I/O is slow, check the disk_write_bytes metric in the Dashboard to identify if the worker's local NVMe is saturated.

Environment & Configuration Errors

Logging for Debugging

Increase the log verbosity of the Rust crates to capture detailed trace events:

# Set log level to debug for the coordinator and storage crates
export RUST_LOG=coordinator=debug,storage=debug,runtime_core=info
cargo run -p coordinator