AWS & S3 Configuration

In large-scale distributed training, persistent and scalable storage is critical for model checkpoints and dataset sharding. Strata supports Amazon S3 (and S3-compatible APIs) as a high-performance backend via the storage crate.

Prerequisites

Before configuring the runtime for AWS, ensure you have:

An AWS account with permissions to manage S3 and IAM.
The AWS CLI installed and configured (aws configure).
The Strata storage crate compiled with the s3 feature (enabled by default in production builds).

Configuration Environment Variables

The runtime identifies and authenticates with S3 using the following environment variables. These should be defined in your .env file or exported in your shell.

| Variable | Description | Default | Required for S3 | |----------|-------------|---------|-----------------| | STORAGE_BACKEND | Set to s3 to enable the S3 driver. | local | Yes | | CHECKPOINT_BUCKET | The name of the S3 bucket for checkpoints. | - | Yes | | AWS_REGION | The AWS region (e.g., us-west-2). | us-east-1 | Yes | | AWS_ACCESS_KEY_ID | Your AWS access key. | - | Yes | | AWS_SECRET_ACCESS_KEY | Your AWS secret key. | - | Yes | | S3_ENDPOINT | Custom endpoint for S3-compatible storage (e.g., MinIO). | - | No |

Automated Infrastructure Setup

The repository includes a utility script to automate bucket creation and configuration. This script ensures that the bucket is optimized for high-throughput training workloads.

# Creates the bucket and applies lifecycle policies
./scripts/setup-aws.sh

S3 Lifecycle Policies

Large-scale training can generate terabytes of checkpoint data daily. To manage costs and storage bloat, we recommend applying a lifecycle policy to your CHECKPOINT_BUCKET.

The setup-aws.sh script applies the following logic by default:

In-Progress Multipart Uploads: Aborted after 7 days to prevent "ghost" storage costs.
Old Checkpoints: Transitioned to infrequent access or deleted after a specified number of training epochs (configurable in the script).

IAM Permissions

The IAM user or role used by the coordinator and workers requires the following minimal policy to function correctly:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
                "arn:aws:s3:::your-checkpoint-bucket-name",
                "arn:aws:s3:::your-checkpoint-bucket-name/*"
            ]
        }
    ]
}

Usage in Rust

If you are using the storage crate directly in a custom implementation, the S3Storage backend provides an asynchronous interface for multipart uploads, which is essential for checkpoints exceeding 5GB.

use storage::{S3Storage, StorageBackend};
use bytes::Bytes;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // S3Storage automatically reads from environment variables
    let storage = S3Storage::new("my-checkpoint-bucket").await?;
    
    // Perform high-speed checkpoint write
    let data = Bytes::from(vec![0u8; 1024]);
    storage.write("checkpoints/epoch_10/model.bin", data).await?;
    
    Ok(())
}

Performance Tuning

To achieve the benchmarked 200 MB/s S3 throughput:

Region Locality: Ensure your compute nodes (EC2/EKS) are in the same AWS region as your S3 bucket to minimize latency and avoid data egress costs.
Multipart Chunk Size: The runtime defaults to 8MB chunks for uploads. For multi-gigabyte models, you can increase this via RuntimeConfig to reduce the number of HTTP requests.
VPC Endpoints: When running in a private VPC, use an S3 Gateway Endpoint to route traffic over the AWS network backbone rather than the public internet.