Resilient Compute

Coming Q2 2026 — This is a design preview. Request early access to be notified when available.

Overview

Resilient Compute provides fault-tolerant ML execution for radiation environments. Space radiation causes Single Event Upsets (SEUs) that flip bits in memory and computation. Instead of ignoring this reality, we build detection and recovery directly into the inference pipeline.

The Problem

In LEO, a typical compute system experiences:

~10-100 SEUs per day in unshielded memory
Silent data corruption in weights and activations
Accumulated errors that compound through layers
Unpredictable failures in traditional ML pipelines

Our Approach

Input Checksums

Verify integrity of model weights and input data before processing.

Layer Execution

Execute each layer with optional redundancy for critical computations (e.g., attention).

Error Detection

Compare checksums and redundant outputs to detect bit flips or corruption.

Recovery

When errors are detected, the system either:

Bounded Error Propagation - Contains errors to prevent cascading through layers
Selective Re-execution - Re-runs only the affected computation

Output Validation

Final validation ensures output integrity before returning results.

Key Capabilities

Error detection — Checksums and redundancy catch corruption
Bounded propagation — Prevent errors from cascading
Selective re-execution — Only re-run affected computation
Graceful degradation — Maintain service despite faults
Transparency — Report what happened and confidence level

API Preview

Enable Resilience Features

from rotastellar import RotaStellarClient

client = RotaStellarClient(api_key="rs_...")

result = client.runtime.generate(
    model="llama-70b",
    prompt="Critical analysis of...",
    resilience={
        "checksum_layers": True,        # Verify layer outputs
        "redundant_attention": True,    # Duplicate attention computation
        "max_reexecute": 3,             # Max re-executions per layer
        "error_threshold": 0.01         # Max acceptable error rate
    }
)

print(f"Response: {result.text}")
print(f"Errors detected: {result.resilience.errors_detected}")
print(f"Errors corrected: {result.resilience.errors_corrected}")
print(f"Confidence: {result.resilience.confidence}")

Resilience Report

Every response includes a resilience report:

{
  "text": "The analysis shows...",
  "resilience": {
    "errors_detected": 2,
    "errors_corrected": 2,
    "reexecutions": 1,
    "confidence": 0.998,
    "layers_validated": 80,
    "checksum_failures": 0,
    "details": [
      {
        "layer": 42,
        "type": "activation_corruption",
        "action": "reexecuted",
        "resolved": true
      },
      {
        "layer": 67,
        "type": "weight_bitflip",
        "action": "corrected_ecc",
        "resolved": true
      }
    ]
  }
}

Configure Resilience Globally

client.runtime.configure(
    resilience={
        # Detection methods
        "checksum_layers": True,           # Checksum layer outputs
        "redundant_attention": True,       # Duplicate attention
        "weight_verification": "periodic", # or "continuous", "none"

        # Recovery behavior
        "max_reexecute": 3,               # Max retries per layer
        "reexecute_strategy": "selective", # or "full_layer"
        "fallback_precision": "fp32",      # Higher precision for re-exec

        # Error bounds
        "error_threshold": 0.01,           # Max error rate before fail
        "confidence_floor": 0.95           # Min acceptable confidence
    }
)

Detection Methods

Layer Checksums

Verify outputs match expected ranges:

result = client.runtime.generate(
    model="llama-70b",
    prompt="...",
    resilience={
        "checksum_layers": True,
        "checksum_granularity": "per_head"  # or "per_layer"
    }
)

Redundant Computation

Run critical computations twice:

result = client.runtime.generate(
    model="llama-70b",
    prompt="...",
    resilience={
        "redundant_attention": True,  # Duplicate attention
        "redundant_ffn": False        # Don't duplicate FFN (too expensive)
    }
)

ECC Memory

Use error-correcting memory when available:

result = client.runtime.generate(
    model="llama-70b",
    prompt="...",
    resilience={
        "require_ecc": True  # Fail if ECC not available
    }
)

Recovery Strategies

Selective Re-execution

Only re-run affected computation:

# If layer 42 has a checksum failure, only re-run layer 42
result = client.runtime.generate(
    model="llama-70b",
    prompt="...",
    resilience={
        "reexecute_strategy": "selective",
        "max_reexecute": 3
    }
)

Full Restart

Re-run entire inference if errors exceed threshold:

result = client.runtime.generate(
    model="llama-70b",
    prompt="...",
    resilience={
        "reexecute_strategy": "full",
        "error_threshold": 0.05  # Restart if >5% layers affected
    }
)

Resilience Modes

Mode	Overhead	Protection	Use Case
`minimal`	5%	Basic detection	Low-criticality
`standard`	15%	Full detection + correction	Default
`high`	30%	Redundant execution	Safety-critical
`paranoid`	100%	Triple redundancy	Life-critical

result = client.runtime.generate(
    model="llama-70b",
    prompt="...",
    resilience={"mode": "high"}
)

Monitoring Radiation Effects

Track SEU rates and their impact:

# Get radiation statistics
stats = client.runtime.radiation_stats(
    node="orbital-leo-1",
    period="24h"
)

print(f"SEUs detected: {stats.seus_detected}")
print(f"SEUs corrected: {stats.seus_corrected}")
print(f"Inference impact: {stats.inference_impact_percent}%")
print(f"Current flux: {stats.current_flux}")

Best Practices

Balance overhead vs protection

Use standard mode for most workloads. Reserve high and paranoid for safety-critical applications.

Monitor confidence scores

Track confidence over time. Degrading confidence may indicate increasing radiation or hardware issues.

Plan for worst case

During solar events, radiation spikes. Consider pausing non-critical workloads during high-flux periods.

Getting Started

Planning Tools

Orbital Intelligence

CAE

Operator Agent

Mission Control

Orbital Runtime

Orbital Sim

Distributed Compute

SDKs

Resources

Resilient Compute

Resilient Compute

Overview

The Problem

Our Approach

Key Capabilities

API Preview

Enable Resilience Features

Resilience Report

Configure Resilience Globally

Detection Methods

Layer Checksums

Redundant Computation

ECC Memory

Recovery Strategies

Selective Re-execution

Full Restart

Resilience Modes

Monitoring Radiation Effects

Best Practices

Next Steps

Adaptive Runtime

Orbit Scheduler

Getting Started

Planning Tools

Orbital Intelligence

CAE

Operator Agent

Mission Control

Orbital Runtime

Orbital Sim

Distributed Compute

SDKs

Resources

​Resilient Compute

​Overview

​The Problem

​Our Approach

​Key Capabilities

​API Preview

​Enable Resilience Features

​Resilience Report

​Configure Resilience Globally

​Detection Methods

​Layer Checksums

​Redundant Computation

​ECC Memory

​Recovery Strategies

​Selective Re-execution

​Full Restart

​Resilience Modes

​Monitoring Radiation Effects

​Best Practices

​Next Steps

Adaptive Runtime

Orbit Scheduler

Resilient Compute

Overview

The Problem

Our Approach

Key Capabilities

API Preview

Enable Resilience Features

Resilience Report

Configure Resilience Globally

Detection Methods

Layer Checksums

Redundant Computation

ECC Memory

Recovery Strategies

Selective Re-execution

Full Restart

Resilience Modes

Monitoring Radiation Effects

Best Practices

Next Steps