Testing Framework

Strata employs a multi-layered testing strategy to ensure the reliability of distributed training operations, covering everything from core logic in Rust to end-to-end training simulations in Python.

1. Rust Integration Tests

The core logic for shard distribution, consistent hashing, and state management is validated through Rust integration tests. These tests simulate worker failures and rebalancing events to ensure data consistency.

Key Test Suites

Shard Rebalancing: Validates that shards are redistributed correctly when workers join or leave the cluster.
Barrier Synchronization: Ensures that the coordinator correctly manages worker arrivals and releases barriers only when the quorum is met.
Storage Backends: Verifies the async I/O path for both local and S3 storage.

To run the Rust test suite:

# Run all tests across all crates
cargo test

# Run tests for a specific crate (e.g., data-shard)
cargo test -p data-shard

# Run tests with logging enabled for debugging
RUST_LOG=debug cargo test

2. Python End-to-End Simulations

Because the runtime is designed for ML practitioners, the framework includes Python-based simulations. These tests use the python-bindings to create a virtual training environment where multiple "workers" perform mock training loops.

Example: Simulating a Training Loop

You can use the TrainingOrchestrator to verify how the runtime handles epoch transitions and checkpointing:

import dtruntime
import time

def test_training_orchestration():
    # Initialize the orchestrator
    orchestrator = dtruntime.TrainingOrchestrator(
        coordinator_url="localhost:50051",
        worker_id="worker-01"
    )

    # Register for a dataset
    shard_info = orchestrator.get_shards("imagenet_total")
    
    # Simulate training steps
    for epoch in range(2):
        for step, batch in enumerate(shard_info):
            # Simulate work
            time.sleep(0.1)
            
            # Save checkpoint every 100 steps
            if step % 100 == 0:
                orchestrator.save_checkpoint(step=step, epoch=epoch)
        
        # Advance to next epoch via the coordinator
        orchestrator.advance_epoch()

To run the Python test suite:

pytest tests/python/

3. Distributed Fault Tolerance Tests

Strata includes specialized tests to verify system resilience. These are located in crates/data-shard/src/lib.rs and can be used as a reference for implementing custom rebalancing logic.

| Test Case | Description | |-----------|-------------| | test_full_workflow | Validates registration, shard assignment, and epoch advancement. | | test_worker_failure_recovery | Simulates a worker dropping out and confirms that >80% of shards remain stable due to consistent hashing. | | test_barrier_timeout | Ensures the coordinator handles stale or unresponsive workers during synchronization. |

4. Performance Benchmarking

For performance-critical components like checkpoint I/O and shard assignment, we use Criterion.rs. These benchmarks measure throughput and p99 latency.

# Run all benchmarks
cargo bench

# Benchmark specific components (e.g., storage throughput)
cargo bench --bench storage_ops

Key Metrics Tracked:

Barrier Latency: The time taken to release 100+ workers.
Shard Assignment: Latency for assigning 10k shards across 1k workers.
Checkpoint Throughput: Sequential and parallel write speeds to S3 vs. Local disk.

5. Dashboard Mock Mode

For UI/Frontend development, the dashboard includes a Demo Mode. This uses simulated WebSocket and API data, allowing you to test dashboard components and visualizations without running the full gRPC coordinator.

To start the dashboard in demo mode:

cd dashboard
npm run dev -- --mode demo