Project Roadmap

Strata is evolving from a distributed coordination core into a comprehensive runtime for planet-scale machine learning. The following roadmap outlines the key performance and feature milestones planned for upcoming releases.

Phase 1: Zero-Copy Networking & RDMA Support

To support the next generation of LLMs where model weights exceed hundreds of gigabytes, we are moving beyond standard TCP/IP for internal data transfer.

RDMA (Remote Direct Memory Access)

We plan to introduce a dedicated RDMA transport layer using the ibverbs ecosystem. This will allow workers to read/write checkpoints directly from the memory of other nodes or high-performance storage arrays without involving the CPU.

User Impact: Reduced CPU overhead during checkpointing and <5ms barrier latency across thousands of nodes.

Proposed Configuration:

# coordinator.yaml
network:
  transport: "rdma"
  device: "mlx5_0"
  port: 1

Shared Memory Sharding

For multi-GPU single-node setups, we are implementing a shared-memory provider for the data-shard crate. This will allow multiple local worker processes to access the same data buffer, eliminating redundant I/O.

Phase 2: Automated Hyperparameter Tuning (HPT)

The Strata Coordinator is uniquely positioned to manage HPT sweeps because it already tracks worker lifecycle and dataset state.

Integrated Tuning Orchestrator

We are extending the TrainingOrchestrator in the Python bindings to support common tuning algorithms (e.g., Bayesian Optimization, Hyperband) directly within the runtime.

Proposed Usage:

from strata import TrainingOrchestrator, SearchSpace

# Define a search space for the coordinator to manage
space = SearchSpace()
space.add_float("learning_rate", min=1e-5, max=1e-2, log=True)
space.add_int("batch_size", values=[32, 64, 128])

# The coordinator will automatically assign different configs to idle workers
orchestrator = TrainingOrchestrator(hpt_mode="hyperband")
orchestrator.tune(my_train_fn, space=space, num_trials=100)

Early Stopping & Pruning

The Coordinator will gain the ability to monitor "heartbeat metrics" (e.g., validation loss) and issue STOP commands to workers running sub-optimal trials, immediately re-assigning those resources to more promising shards.

Phase 3: Native WebDataset & Streaming Support

While current versions support local and S3 files, Phase 3 focuses on native support for web-scale data formats.

Native WebDataset Parsing: Moving .tar shard extraction into the Rust core to maximize throughput during the data-loading phase.
S3 Select Integration: Utilizing AWS S3 Select to filter and sample data directly on the storage layer before it reaches the worker nodes, significantly reducing egress costs.

Phase 4: Observability & Resilience

Interactive Checkpoint Explorer: A dashboard update to allow users to "peek" inside saved checkpoints, inspecting model metadata or weight distributions without a full restore.
Elastic Scaling (K8s Operator): A dedicated Kubernetes operator that monitors coordinator-rps and barrier-latency to automatically scale the number of workers based on training throughput targets.

Community & Feedback

As an open-source portfolio project, roadmap priorities are influenced by community interest. If you are interested in a specific feature (like support for a specific storage backend or framework integration), please open an issue in the repository.