Resilient Compute
Overview
Resilient Compute provides fault-tolerant ML execution for radiation environments. Space radiation causes Single Event Upsets (SEUs) that flip bits in memory and computation. Instead of ignoring this reality, we build detection and recovery directly into the inference pipeline.The Problem
In LEO, a typical compute system experiences:- ~10-100 SEUs per day in unshielded memory
- Silent data corruption in weights and activations
- Accumulated errors that compound through layers
- Unpredictable failures in traditional ML pipelines
Our Approach
Key Capabilities
- Error detection — Checksums and redundancy catch corruption
- Bounded propagation — Prevent errors from cascading
- Selective re-execution — Only re-run affected computation
- Graceful degradation — Maintain service despite faults
- Transparency — Report what happened and confidence level
API Preview
Enable Resilience Features
Resilience Report
Every response includes a resilience report:Configure Resilience Globally
Detection Methods
Layer Checksums
Verify outputs match expected ranges:Redundant Computation
Run critical computations twice:ECC Memory
Use error-correcting memory when available:Recovery Strategies
Selective Re-execution
Only re-run affected computation:Full Restart
Re-run entire inference if errors exceed threshold:Resilience Modes
| Mode | Overhead | Protection | Use Case |
|---|---|---|---|
minimal | 5% | Basic detection | Low-criticality |
standard | 15% | Full detection + correction | Default |
high | 30% | Redundant execution | Safety-critical |
paranoid | 100% | Triple redundancy | Life-critical |
Monitoring Radiation Effects
Track SEU rates and their impact:Best Practices
Balance overhead vs protection
Balance overhead vs protection
Use
standard mode for most workloads. Reserve high and paranoid
for safety-critical applications.Monitor confidence scores
Monitor confidence scores
Track confidence over time. Degrading confidence may indicate
increasing radiation or hardware issues.
Plan for worst case
Plan for worst case
During solar events, radiation spikes. Consider pausing
non-critical workloads during high-flux periods.