Metrics & Log Aggregation

Strata provides built-in observability for distributed training runs by aggregating system-level metrics and structured logs from the coordinator and all connected workers. These are exposed via a REST API for integration with the React dashboard or external monitoring stacks (e.g., Prometheus/Grafana).

Performance Metrics

The system tracks high-frequency performance data and resource utilization to identify bottlenecks in data loading or checkpointing.

System-Wide Metrics

The coordinator aggregates global performance data, accessible via the GET /api/metrics endpoint.

| Metric | Description | |--------|-------------| | checkpoint_throughput | Aggregated write speed to storage backends (MB/s). | | coordinator_rps | Number of gRPC requests handled by the coordinator per second. | | active_workers | Count of workers currently in Training or Checkpointing states. | | barrier_latency_p99 | 99th percentile latency for worker synchronization barriers. | | shard_assignment_time | Latency for the consistent hashing algorithm to rebalance data shards. |

Worker Resource Monitoring

Each worker reports its local resource consumption during heartbeats. This data is available in the WorkerResponse object.

// Captured per-worker and aggregated by the coordinator
pub struct ResourceMetrics {
    pub cpu_percent: f64,
    pub memory_used_bytes: u64,
    pub gpu_metrics: Vec<GpuMetrics>, // Includes utilization and memory per card
    pub disk_read_bytes: u64,
    pub disk_write_bytes: u64,
    pub network_rx_bytes: u64,
    pub network_tx_bytes: u64,
}

Log Aggregation

Strata supports centralized logging, allowing you to view stdout/stderr from across the cluster in a single stream. Logs are categorized by their source and can be filtered by specific training tasks.

Log Levels

Logging is controlled by the RUST_LOG environment variable. Standard levels include:

error: Critical failures (e.g., S3 connection lost, worker crash).
warn: Recoverable issues (e.g., retrying a failed shard download).
info: Standard lifecycle events (e.g., barrier reached, checkpoint saved).
debug: Detailed internal state (e.g., consistent hash ring updates).

Structured Log Format

Logs retrieved via the API follow a structured JSON format to simplify parsing:

{
  "id": "log_8f2d1b",
  "timestamp": 1715832000,
  "level": "info",
  "message": "Checkpoint 'ckpt_v12' persisted to S3 in 450ms",
  "source": "worker",
  "task_id": "training_run_001",
  "worker_id": "gpu-node-05"
}

API Access

You can programmatically query the state of the system using the following HTTP endpoints on the coordinator:

`GET /api/metrics`

Returns current system throughput and coordinator health.

Response Type: MetricsResponse

`GET /api/logs`

Returns a list of recent log entries from all sources.

Query Parameters: limit (default: 100).
Response Type: Vec<LogResponse>

`GET /api/tasks/:task_id/logs`

Filters logs specifically for a single training job or data sharding task.

Response Type: Vec<LogResponse>

`GET /api/dashboard`

A "God-view" endpoint that returns the complete system state, including metrics, active workers, current checkpoints, and recent logs in a single payload. Used primarily by the frontend to minimize network round-trips.

Dashboard Visualization

The included React dashboard provides real-time visualization of these metrics:

Throughput Graphs: Live charts showing checkpointing speed and API requests per second.
Worker Heatmap: A visual grid of worker status (Active, Idle, Failed) and their current GPU utilization.
Live Log Tail: A searchable console that streams logs from the coordinator and workers simultaneously.
Data Preview: A visualization of sample features and labels currently being processed by specific workers to verify data sharding logic.