Storage Backends
Strata provides a pluggable storage layer designed to handle high-throughput persistence for training checkpoints and dataset shards. By using a unified StorageBackend trait, the runtime abstracts away the complexities of different storage providers while maintaining high-performance asynchronous I/O.
Available Backends
Strata currently supports two primary storage backends:
| Backend | Feature Flag | Use Case |
|---------|--------------|----------|
| Local | Default | Development, single-node training, and high-speed local NVMe caching. |
| S3 | s3 | Distributed production training, multi-worker checkpointing, and long-term persistence. |
Configuration
The storage backend is configured primarily through environment variables or via the CheckpointManagerConfig in Rust.
Environment Variables
To switch between backends in a deployed environment (e.g., Docker or Kubernetes), set the following variables:
# Set the backend type: 'local' or 's3'
STORAGE_BACKEND=s3
# S3 Specific Configuration
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
CHECKPOINT_BUCKET=my-training-checkpoints
# Local Specific Configuration
# Checkpoints will be saved relative to this path
LOCAL_STORAGE_PATH=/tmp/strata/checkpoints
Rust Feature Flags
If you are using the storage crate as a dependency, ensure you enable the s3 feature in your Cargo.toml if you require cloud storage:
[dependencies]
storage = { version = "0.1.0", features = ["s3"] }
Usage in Rust
The StorageBackend trait provides a consistent async interface for file operations.
The StorageBackend Trait
All providers implement the following core interface:
pub trait StorageBackend: Send + Sync {
/// Write data to a specific path
async fn write(&self, path: &str, data: Bytes) -> Result<()>;
/// Read data from a specific path
async fn read(&self, path: &str) -> Result<Bytes>;
/// Check if a file exists
async fn exists(&self, path: &str) -> Result<bool>;
/// Delete a file
async fn delete(&self, path: &str) -> Result<()>;
}
Initializing Local Storage
Local storage maps a root directory on the host machine to the storage interface.
use storage::LocalStorage;
use bytes::Bytes;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let storage = LocalStorage::new("/mnt/nvme/checkpoints");
// Write a checkpoint
storage.write("epoch_1/model.bin", Bytes::from(vec![0; 1024])).await?;
Ok(())
}
Initializing S3 Storage
The S3 backend uses the AWS SDK for Rust and supports multi-part uploads for large checkpoint files.
use storage::S3Storage;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Automatically loads credentials from environment or IAM roles
let storage = S3Storage::new("my-bucket-name", "us-east-1").await?;
let data = storage.read("checkpoints/latest.pt").await?;
println!("Read {} bytes from S3", data.len());
Ok(())
}
Performance Considerations
Strata's storage layer is optimized for ML workloads:
- Non-blocking I/O: All operations use
tokioto ensure the coordinator or worker remains responsive during heavy disk/network activity. - Throughput: Local NVMe storage can reach up to 500 MB/s, while S3 throughput averages 200 MB/s depending on network conditions and instance types.
- Multipart Uploads: When using the S3 backend, large model state files (GBs) are automatically handled via multipart uploads to prevent timeout issues and improve reliability.
Data Structure
When checkpoints are persisted, they are typically stored with the following structure regardless of the backend:
{root}/
├── {dataset_id}/
│ └── shards/
└── {job_id}/
└── checkpoints/
├── step_1000/
│ ├── metadata.json
│ └── model.bin
└── step_2000/
├── metadata.json
└── model.bin
Metadata for these files is managed by the CheckpointManager and tracked in the Coordinator's state for fast recovery during worker failures.