Real-time Dashboard
The Strata Real-time Dashboard is a centralized observability platform for monitoring and managing distributed training jobs. Built with React and Tailwind CSS, it provides a high-fidelity view of the coordinator’s state, worker health, and data distribution metrics.
Accessing the Dashboard
By default, the dashboard is served alongside the coordinator.
- Start the Coordinator: Run
cargo run -p coordinatoror use the provided Docker Compose setup. - Open URL: Navigate to
http://localhost:3000in your browser.
Dashboard Modes
- Demo Mode: Uses simulated data to showcase UI capabilities without a running cluster.
- Live Mode: Connects directly to the Coordinator's REST API to display real-time cluster state.
Key Observability Modules
1. System Metrics
The header and main overview panel display aggregate performance indicators derived from the ApiMetrics interface:
- Checkpoint Throughput: Real-time write speeds to the storage backend (Local or S3).
- Coordinator RPS: The number of gRPC requests handled by the coordinator per second.
- Barrier Latency (p99): The tail latency for worker synchronization, critical for identifying stragglers.
- Shard Assignment Time: Latency for the consistent hashing algorithm to distribute data.
2. Worker Fleet Management
This view tracks the lifecycle of every worker registered in the WorkerRegistry.
- Status Tracking: Monitors if workers are
Active,Idle,Checkpointing, orRecovering. - Resource Utilization: Displays GPU count and hardware metrics reported via heartbeats.
- Heartbeat Monitor: Visualizes the
last_heartbeattimestamp to help identify network partitions or worker crashes.
3. Data Distribution & Sharding
The dashboard visualizes how the ShardManager partitions datasets across the cluster:
- Shard Assignments: Shows which shards (IDs and sample ranges) are assigned to specific workers.
- Epoch Progress: Tracks the current epoch and step for each worker.
- Data Preview: A specialized component that provides a representative view of the dataset (e.g., ImageNet labels or time-series features) currently being processed.
4. Checkpoint History
Monitor the persistence layer as training progresses:
- Persistence Status: Tracks checkpoints as they transition from
in_progresstocompleted. - Storage Metadata: Displays the S3/Local path, file size, and the specific worker that authored the checkpoint.
Task Management
The dashboard allows for remote orchestration of training tasks through the Tasks panel.
Starting a Task
You can submit a CreateTaskRequest via the UI by specifying:
{
name: "imagenet-resnet50",
type: "training",
dataset_id: "ds-001",
worker_count: 8,
config: {
"learning_rate": 0.001,
"batch_size": 32
}
}
Log Streaming
The dashboard provides a real-time log aggregator that filters messages by task_id or worker_id. Levels (INFO, WARN, ERROR) are color-coded for rapid debugging of distributed failures.
Configuration & API
The dashboard communicates with the coordinator via a RESTful API (defined in crates/coordinator/src/http_api.rs).
Environment Variables
To point the dashboard to a specific coordinator instance, set the following variable during build or deployment:
| Variable | Description | Default |
| :--- | :--- | :--- |
| VITE_API_URL | The base URL for the Coordinator REST API | /api |
Primary API Endpoints
If you wish to consume the dashboard data programmatically, the following endpoints are available:
| Endpoint | Method | Description |
| :--- | :--- | :--- |
| /api/status | GET | Overall coordinator health and uptime. |
| /api/dashboard | GET | Returns the full DashboardState (metrics, workers, tasks). |
| /api/workers | GET | Detailed list of all registered workers. |
| /api/tasks | POST | Submit a new training task to the cluster. |
| /api/tasks/:id/stop| POST | Gracefully terminate a running task. |