Skip to main content

Model Partitioning

Coming Q1 2026 — This feature is in development. Request early access to be notified when available.

Overview

Large neural networks can be split across Earth and orbital nodes to optimize for latency, bandwidth, or energy efficiency. The Model Partitioning system finds optimal cut points based on your infrastructure topology.

Key Components

PartitionOptimizer

Finds optimal model split points

ModelProfile

Analyzes model layer characteristics

LayerPlacement

Specifies ground vs orbital assignment

LatencyEstimator

Predicts end-to-end inference latency

Why Partition Models?

ScenarioBenefit
Large models, limited orbital memoryRun embedding layers on ground, attention in orbit
Latency-sensitive inferencePlace early layers close to data source
Energy optimizationCompute-heavy layers on solar-powered orbital nodes
Bandwidth constraintsMinimize activation transfer between nodes

How It Works

1

Input Processing (Ground)

Input tokens are received at a ground node where the Embedding Layer (150M params) and Layers 0-10 (2.8B params) process the initial representation.
2

Activation Transfer (Uplink)

Compressed activations (12 MB) are transmitted to the orbital node via ground-to-space link.
3

Core Computation (Orbital)

Layers 11-60 (35B params) run on the orbital node - the most compute-intensive portion of the model, powered by solar energy.
4

Activation Transfer (Downlink)

Output activations (12 MB) are transmitted back to a ground node.
5

Output Generation (Ground)

Layers 61-80 (14B params) and the Output Head generate the final output tokens.
The partition optimizer automatically finds cut points that minimize total latency while respecting memory constraints on each node type.

Model Profile

First, analyze your model to understand layer characteristics:
from rotastellar_distributed import ModelProfile

# From PyTorch model
profile = ModelProfile.from_pytorch(model)

# From TensorFlow/Keras
profile = ModelProfile.from_tensorflow(model)

# From ONNX file
profile = ModelProfile.from_onnx("model.onnx")

# Inspect profile
print(f"Total parameters: {profile.total_params:,}")
print(f"Total layers: {profile.num_layers}")
print(f"Memory footprint: {profile.memory_mb:.1f} MB")

# Per-layer analysis
for layer in profile.layers:
    print(f"{layer.name}: {layer.params:,} params, "
          f"{layer.flops:,} FLOPs, "
          f"{layer.activation_size_mb:.1f} MB activations")

Partition Optimizer

Find optimal cut points based on your topology:
from rotastellar_distributed import PartitionOptimizer, ModelProfile

# Define your infrastructure
topology = {
    "ground_nodes": 2,
    "orbital_nodes": 4,
    "ground_flops": 100e12,       # 100 TFLOPS per ground node
    "orbital_flops": 20e12,       # 20 TFLOPS per orbital node
    "uplink_bandwidth": 100e6,    # 100 Mbps ground→orbit
    "downlink_bandwidth": 500e6,  # 500 Mbps orbit→ground
    "isl_bandwidth": 10e9,        # 10 Gbps inter-satellite
    "ground_orbit_latency_ms": 25 # LEO latency
}

# Profile your model
profile = ModelProfile.from_pytorch(model)

# Find optimal partition
optimizer = PartitionOptimizer(api_key="rs_...")
partition = optimizer.optimize(
    model=profile,
    topology=topology,
    objective="minimize_latency"  # or "minimize_bandwidth", "balance"
)

# View results
print(f"Optimal cut points: {partition.cut_points}")
print(f"Ground layers: {partition.ground_layers}")
print(f"Orbital layers: {partition.orbital_layers}")
print(f"Estimated latency: {partition.estimated_latency_ms:.1f} ms")
print(f"Activation transfer: {partition.transfer_size_mb:.1f} MB")

Optimization Objectives

ObjectiveOptimizes ForBest When
minimize_latencyEnd-to-end inference timeReal-time applications
minimize_bandwidthData transfer between nodesLimited connectivity
minimize_energyTotal energy consumptionBattery/solar constraints
balanceWeighted combinationGeneral purpose

Layer Placement

Manually specify or adjust layer placement:
from rotastellar_distributed import LayerPlacement

# Manual placement
placement = LayerPlacement()
placement.assign_ground(layers=[0, 1, 2, 3, 4])      # First 5 layers
placement.assign_orbital(layers=range(5, 75))         # Middle layers
placement.assign_ground(layers=[75, 76, 77, 78, 79])  # Last 5 layers

# Validate placement
validation = placement.validate(profile, topology)
if not validation.is_valid:
    print(f"Issues: {validation.issues}")

# Or refine optimizer result
partition = optimizer.optimize(model=profile, topology=topology)
partition.move_layer(15, to="ground")  # Manual adjustment
partition.recalculate()

Latency Estimation

Predict inference latency for a given partition:
from rotastellar_distributed import LatencyEstimator

estimator = LatencyEstimator(topology=topology)

# Estimate for a partition
estimate = estimator.estimate(partition)

print(f"Total latency: {estimate.total_ms:.1f} ms")
print(f"  Ground compute: {estimate.ground_compute_ms:.1f} ms")
print(f"  Orbital compute: {estimate.orbital_compute_ms:.1f} ms")
print(f"  Uplink transfer: {estimate.uplink_ms:.1f} ms")
print(f"  Downlink transfer: {estimate.downlink_ms:.1f} ms")
print(f"  Propagation: {estimate.propagation_ms:.1f} ms")

# Breakdown by layer
for layer_est in estimate.by_layer:
    print(f"  {layer_est.name}: {layer_est.total_ms:.1f} ms on {layer_est.node}")

Example: LLaMA-70B Partitioning

from rotastellar_distributed import PartitionOptimizer, ModelProfile

# LLaMA-70B architecture
profile = ModelProfile.from_pytorch(llama_70b)
# 80 transformer layers, ~70B parameters

topology = {
    "ground_nodes": 3,
    "orbital_nodes": 5,
    "ground_flops": 200e12,      # A100 equivalent
    "orbital_flops": 50e12,      # Space-qualified GPU
    "uplink_bandwidth": 200e6,
    "downlink_bandwidth": 1e9,
    "isl_bandwidth": 25e9
}

partition = optimizer.optimize(
    model=profile,
    topology=topology,
    objective="minimize_latency"
)

# Result for LLaMA-70B:
# - Layers 0-8: Ground (embeddings + early attention)
# - Layers 9-72: Orbital (bulk computation)
# - Layers 73-79 + head: Ground (final layers)
# - Activation transfer: 24 MB per inference
# - Estimated latency: 180 ms (vs 400 ms all-ground)

Next Steps