Metrics & Log Aggregation
Strata provides built-in observability for distributed training runs by aggregating system-level metrics and structured logs from the coordinator and all connected workers. These are exposed via a REST API for integration with the React dashboard or external monitoring stacks (e.g., Prometheus/Grafana).
Performance Metrics
The system tracks high-frequency performance data and resource utilization to identify bottlenecks in data loading or checkpointing.
System-Wide Metrics
The coordinator aggregates global performance data, accessible via the GET /api/metrics endpoint.
| Metric | Description |
|--------|-------------|
| checkpoint_throughput | Aggregated write speed to storage backends (MB/s). |
| coordinator_rps | Number of gRPC requests handled by the coordinator per second. |
| active_workers | Count of workers currently in Training or Checkpointing states. |
| barrier_latency_p99 | 99th percentile latency for worker synchronization barriers. |
| shard_assignment_time | Latency for the consistent hashing algorithm to rebalance data shards. |
Worker Resource Monitoring
Each worker reports its local resource consumption during heartbeats. This data is available in the WorkerResponse object.
// Captured per-worker and aggregated by the coordinator
pub struct ResourceMetrics {
pub cpu_percent: f64,
pub memory_used_bytes: u64,
pub gpu_metrics: Vec<GpuMetrics>, // Includes utilization and memory per card
pub disk_read_bytes: u64,
pub disk_write_bytes: u64,
pub network_rx_bytes: u64,
pub network_tx_bytes: u64,
}
Log Aggregation
Strata supports centralized logging, allowing you to view stdout/stderr from across the cluster in a single stream. Logs are categorized by their source and can be filtered by specific training tasks.
Log Levels
Logging is controlled by the RUST_LOG environment variable. Standard levels include:
error: Critical failures (e.g., S3 connection lost, worker crash).warn: Recoverable issues (e.g., retrying a failed shard download).info: Standard lifecycle events (e.g., barrier reached, checkpoint saved).debug: Detailed internal state (e.g., consistent hash ring updates).
Structured Log Format
Logs retrieved via the API follow a structured JSON format to simplify parsing:
{
"id": "log_8f2d1b",
"timestamp": 1715832000,
"level": "info",
"message": "Checkpoint 'ckpt_v12' persisted to S3 in 450ms",
"source": "worker",
"task_id": "training_run_001",
"worker_id": "gpu-node-05"
}
API Access
You can programmatically query the state of the system using the following HTTP endpoints on the coordinator:
GET /api/metrics
Returns current system throughput and coordinator health.
- Response Type:
MetricsResponse
GET /api/logs
Returns a list of recent log entries from all sources.
- Query Parameters:
limit(default: 100). - Response Type:
Vec<LogResponse>
GET /api/tasks/:task_id/logs
Filters logs specifically for a single training job or data sharding task.
- Response Type:
Vec<LogResponse>
GET /api/dashboard
A "God-view" endpoint that returns the complete system state, including metrics, active workers, current checkpoints, and recent logs in a single payload. Used primarily by the frontend to minimize network round-trips.
Dashboard Visualization
The included React dashboard provides real-time visualization of these metrics:
- Throughput Graphs: Live charts showing checkpointing speed and API requests per second.
- Worker Heatmap: A visual grid of worker status (Active, Idle, Failed) and their current GPU utilization.
- Live Log Tail: A searchable console that streams logs from the coordinator and workers simultaneously.
- Data Preview: A visualization of sample features and labels currently being processed by specific workers to verify data sharding logic.