Dashboard REST API
The Strata Coordinator provides a RESTful HTTP API primarily designed for the web dashboard, but it is also available for custom monitoring, CLI tools, and automation scripts. The API allows you to query the state of the distributed cluster, monitor performance metrics, and manage training tasks.
Base Configuration
By default, the REST API is served by the coordinator. In a local development environment, the base URL is:
http://localhost:3000/apiAll responses are returned in JSON format.
System Endpoints
Health Check
GET /api/health
Returns the connectivity status of the coordinator.
Response:
{
"status": "ok"
}
Coordinator Status
GET /api/status
Returns version information and uptime for the coordinator node.
Response:
{
"connected": true,
"address": "0.0.0.0:50051",
"uptime": 3600,
"version": "0.1.0"
}
Cluster & Data Monitoring
Worker Inventory
GET /api/workers
Returns a list of all workers currently registered with the coordinator, including their hardware specs and current training progress.
Response Schema (WorkerResponse):
| Field | Type | Description |
| :--- | :--- | :--- |
| id | string | Unique worker identifier |
| status | string | active, idle, failed, or recovering |
| gpu_count | integer | Number of GPUs available on this node |
| assigned_shards | integer | Number of data shards currently held |
| current_step | integer | Last reported training step |
Dataset Registry
GET /api/datasets
Lists all datasets registered for sharding and distribution.
Response Schema (DatasetResponse):
| Field | Type | Description |
| :--- | :--- | :--- |
| id | string | Dataset identifier |
| total_samples | integer | Total number of records in the dataset |
| shard_count | integer | Number of logical shards created |
| format | string | e.g., parquet, tfrecord, webdataset |
Checkpoint History
GET /api/checkpoints
Returns a list of saved model checkpoints across all workers and storage backends.
Task Management
List Tasks
GET /api/tasks
Returns all active and completed training tasks.
Create Task
POST /api/tasks
Submit a new training job to the cluster.
Request Body:
{
"name": "ResNet-50-Imagenet",
"type": "training",
"dataset_id": "imagenet-2024",
"worker_count": 8,
"config": {
"learning_rate": 0.001,
"batch_size": 32
}
}
Stop Task
POST /api/tasks/:task_id/stop
Signals a running task to stop gracefully at the next synchronization barrier.
Metrics & Observability
System Metrics
GET /api/metrics
Provides real-time performance telemetry for the entire runtime.
Response:
{
"checkpoint_throughput": 450,
"coordinator_rps": 1250,
"active_workers": 64,
"total_workers": 64,
"barrier_latency_p99": 42,
"shard_assignment_time": 8
}
checkpoint_throughput: Measured in MB/s.barrier_latency_p99: The 99th percentile latency for worker synchronization (ms).shard_assignment_time: Time taken to calculate consistent hashing for the cluster (ms).
Global Logs
GET /api/logs?limit=100
Retrieves the latest system-wide log entries from the coordinator and workers.
Task-Specific Logs
GET /api/tasks/:task_id/logs
Returns logs filtered specifically for a single training job.
Aggregate State
Dashboard State
GET /api/dashboard
A "fat" endpoint that returns the complete system state (workers, datasets, checkpoints, barriers, metrics, and tasks) in a single request. This is the recommended endpoint for high-frequency dashboard polling to reduce network overhead.
Response Structure:
{
"coordinator": { ... },
"workers": [ ... ],
"datasets": [ ... ],
"checkpoints": [ ... ],
"barriers": [ ... ],
"metrics": { ... },
"tasks": [ ... ],
"logs": [ ... ]
}