Project Roadmap
Project Roadmap
Strata is evolving from a distributed coordination core into a comprehensive runtime for planet-scale machine learning. The following roadmap outlines the key performance and feature milestones planned for upcoming releases.
Phase 1: Zero-Copy Networking & RDMA Support
To support the next generation of LLMs where model weights exceed hundreds of gigabytes, we are moving beyond standard TCP/IP for internal data transfer.
RDMA (Remote Direct Memory Access)
We plan to introduce a dedicated RDMA transport layer using the ibverbs ecosystem. This will allow workers to read/write checkpoints directly from the memory of other nodes or high-performance storage arrays without involving the CPU.
- User Impact: Reduced CPU overhead during checkpointing and <5ms barrier latency across thousands of nodes.
- Proposed Configuration:
# coordinator.yaml network: transport: "rdma" device: "mlx5_0" port: 1
Shared Memory Sharding
For multi-GPU single-node setups, we are implementing a shared-memory provider for the data-shard crate. This will allow multiple local worker processes to access the same data buffer, eliminating redundant I/O.
Phase 2: Automated Hyperparameter Tuning (HPT)
The Strata Coordinator is uniquely positioned to manage HPT sweeps because it already tracks worker lifecycle and dataset state.
Integrated Tuning Orchestrator
We are extending the TrainingOrchestrator in the Python bindings to support common tuning algorithms (e.g., Bayesian Optimization, Hyperband) directly within the runtime.
- Proposed Usage:
from strata import TrainingOrchestrator, SearchSpace # Define a search space for the coordinator to manage space = SearchSpace() space.add_float("learning_rate", min=1e-5, max=1e-2, log=True) space.add_int("batch_size", values=[32, 64, 128]) # The coordinator will automatically assign different configs to idle workers orchestrator = TrainingOrchestrator(hpt_mode="hyperband") orchestrator.tune(my_train_fn, space=space, num_trials=100)
Early Stopping & Pruning
The Coordinator will gain the ability to monitor "heartbeat metrics" (e.g., validation loss) and issue STOP commands to workers running sub-optimal trials, immediately re-assigning those resources to more promising shards.
Phase 3: Native WebDataset & Streaming Support
While current versions support local and S3 files, Phase 3 focuses on native support for web-scale data formats.
- Native WebDataset Parsing: Moving
.tarshard extraction into the Rust core to maximize throughput during the data-loading phase. - S3 Select Integration: Utilizing AWS S3 Select to filter and sample data directly on the storage layer before it reaches the worker nodes, significantly reducing egress costs.
Phase 4: Observability & Resilience
- Interactive Checkpoint Explorer: A dashboard update to allow users to "peek" inside saved checkpoints, inspecting model metadata or weight distributions without a full restore.
- Elastic Scaling (K8s Operator): A dedicated Kubernetes operator that monitors
coordinator-rpsandbarrier-latencyto automatically scale the number of workers based on training throughput targets.
Community & Feedback
As an open-source portfolio project, roadmap priorities are influenced by community interest. If you are interested in a specific feature (like support for a specific storage backend or framework integration), please open an issue in the repository.