VLA Documentation

Complete API reference for SimGen VLA

Installation

pip install simgen-vla

Requirements: Python 3.10+, PyTorch 2.0+ with CUDA, NVIDIA GPU (Pascal through Hopper: sm_60 to sm_90)

VLADecimal - GPU-Native Extended Precision

NEW in v3.5.0

VLADecimal is a GPU-native extended precision type (106+ bit mantissa) that keeps ALL operations on GPU. No CPU conversions until you explicitly request a Python Decimal.

import torch
from simgen import vla

# Create GPU-native extended precision tensors
x = vla.Decimal(torch.randn(1000, device='cuda'))
y = vla.Decimal(torch.randn(1000, device='cuda'))

# All operations stay on GPU with full precision
result = (x + y * 2).sum()

# Display exact value (converts to Python Decimal only for display)
print(result)  # VLADecimal(-12.34567890123456789...)

# Explicit conversion when you need Python Decimal
exact_value = result.to_decimal()  # decimal.Decimal object

# Convert back to torch.Tensor when done
tensor = result.to_torch()  # float64 tensor

Features

  • 82 methods - Full arithmetic, reductions, linear algebra, trig, and more
  • GPU-native - All operations stay on GPU until you explicitly convert
  • Chainable - (x + y).sum().sqrt() preserves precision throughout
  • Indexing - x[0], x[1:10] all return VLADecimal
  • Shape ops - reshape, view, squeeze, transpose, flatten, etc.

Factory Functions

# Create from exact fractions (TRUE zero representation error)
x = vla.Decimal_frac(1, 3)  # Exact 1/3

# Create zeros/ones/random
zeros = vla.Decimal_zeros((100, 100))
ones = vla.Decimal_ones((100, 100))
randn = vla.Decimal_randn((100, 100))

# Concatenate/stack VLADecimal tensors
combined = vla.Decimal_cat([x, y], dim=0)
stacked = vla.Decimal_stack([x, y], dim=0)

TRUE ZERO Exact Arithmetic

NEW in v3.5.3 - GPU ModularTensor

TRUE ZERO error arithmetic on GPU using proprietary exact computation. Not ~1e-15 approximation - mathematically ZERO error for all operations.

GPU: ModularTensor (CUDA accelerated)

from simgen import vla

# GPU exact arithmetic - TRUE ZERO error
a = vla.ModularTensor.from_fraction(1, 3, shape=(10000,), device='cuda')
b = vla.ModularTensor.from_fraction(1, 6, shape=(10000,), device='cuda')
c = a + b  # Exactly 1/2 for all 10,000 elements

# Exact equality check
expected = vla.ModularTensor.from_fraction(1, 2, shape=(10000,), device='cuda')
print((c == expected).all())  # True - TRUE ZERO!

# 100K iterations - still TRUE ZERO
acc = vla.ModularTensor.from_int(0, shape=(1,), device='cuda')
delta = vla.ModularTensor.from_fraction(1, 100000, shape=(1,), device='cuda')
for _ in range(100000):
    acc = acc + delta
print((acc == vla.ModularTensor.from_int(1, shape=(1,), device='cuda')).all())  # True!

CPU: ModularRational

from simgen import vla

# CPU exact arithmetic
a = vla.ModularRational.from_fraction(1, 3)  # Exactly 1/3
b = vla.ModularRational.from_fraction(1, 6)  # Exactly 1/6
c = a + b  # Exactly 1/2

print(c == vla.ModularRational.from_fraction(1, 2))  # True - EXACT!

Why TRUE ZERO?

  • TRUE ZERO error - Not ~1e-15, but mathematically ZERO
  • GPU accelerated - 444M ops/sec on RTX 4070
  • Constant memory - Fixed memory per value regardless of operation count
  • Chaotic systems - Lorenz 50,000 steps with exact reversibility

Vectors and Matrices

# Exact vector operations
vec_a = vla.ModularVector.from_fractions([(1,2), (1,3), (1,4)])
vec_b = vla.ModularVector.from_fractions([(2,1), (3,1), (4,1)])
dot = vec_a.dot(vec_b)  # Exactly 3

# Exact matrix operations
A = vla.ModularMatrix.from_ints([[1,1],[1,0]])  # Fibonacci matrix
F10 = A.power(10)  # F10[0,0] = 89 - EXACT!

# Matrix multiply
I = vla.ModularMatrix.identity(2)
B = A * I  # A @ I = A exactly

Chaotic Systems (Lorenz)

# Lorenz attractor with EXACT time reversibility
from simgen import vla

def lorenz_derivs(x, y, z):
    sigma = vla.ModularRational.from_int(10)
    rho = vla.ModularRational.from_int(28)
    beta = vla.ModularRational.from_fraction(8, 3)
    return sigma*(y-x), x*(rho-z)-y, x*y - beta*z

dt = vla.ModularRational.from_fraction(1, 1000)
x, y, z = [vla.ModularRational.from_int(1)]*3
stored = []

# Forward 5000 steps
for _ in range(5000):
    dx, dy, dz = lorenz_derivs(x, y, z)
    stored.append((dx, dy, dz))
    x, y, z = x + dt*dx, y + dt*dy, z + dt*dz

# Backward 5000 steps (using stored derivatives)
for dx, dy, dz in reversed(stored):
    x, y, z = x - dt*dx, y - dt*dy, z - dt*dz

# TRUE ZERO - back to start!
print(x == vla.ModularRational.from_int(1))  # True

When to use each precision level

  • VLADecimal (GPU): Production simulations where ~1e-15 error is acceptable
  • ModularTensor (GPU): TRUE ZERO on GPU - financial, cryptographic, verification
  • ModularRational (CPU): TRUE ZERO scalar arithmetic

Quick Start

import torch
from simgen import vla

# Create test data
x = torch.randn(10000, device='cuda')

# Exact sum with zero accumulation error
result = vla.sum(x)

# Get cross-GPU checksum (SAME on any GPU!)
checksum = vla.checksum(result)
print(f"Checksum: {checksum}")  # e.g., "6ece6956f187064f"

The Killer Feature

VLA checksums are BIT-IDENTICAL across all NVIDIA GPUs. Run the same computation on RTX 4070, Tesla T4, or A100 - get the exact same checksum.

Understanding VLA's Guarantee

What VLA Guarantees: ZERO ACCUMULATION Error

VLA eliminates accumulation error - the errors that compound when performing arithmetic operations. Every +, -, *, / is mathematically exact:

  • Order independence: (a + b) + c = a + (b + c) always (impossible with IEEE 754)
  • Cross-GPU reproducibility: Same computation = identical result on any GPU
  • No error growth: Million-step simulations don't accumulate drift

What VLA Cannot Fix: INPUT REPRESENTATION Error

VLA cannot fix errors that exist before it sees your data. When you write 0.001 in Python, it's already corrupted:

# 0.001 is NOT exactly representable in binary
# Python stores it as 0.001000000000000000020816681711721685228...
x = torch.tensor([0.001], device='cuda')
# The error already exists BEFORE VLA sees this tensor

Binary-Exact vs Non-Binary-Exact Values

import torch
from simgen import vla

# TEST 1: Non-binary-exact input (0.001)
# 0.001 requires infinite bits - stored as approximation
increment = 0.001
x_vla = torch.tensor([0.0], device='cuda')
for _ in range(100_000):
    x_vla = vla.add(x_vla, torch.tensor([increment], device='cuda'))
# Result: 100.00000000133288 (input representation error)

# TEST 2: Binary-exact input (0.125 = 1/8 = 2^-3)
# This is EXACTLY representable in binary!
increment = 0.125
x_vla = torch.tensor([0.0], device='cuda')
for _ in range(100_000):
    x_vla = vla.add(x_vla, torch.tensor([increment], device='cuda'))
# Result: 12500.0 EXACTLY (TRUE ZERO error)

Binary-Exact Values (TRUE Zero Error)

ValueBinaryExact?
0.52^-1YES
0.252^-2YES
0.1252^-3YES
0.06252^-4YES
0.001Infinite expansionNO
0.1Infinite expansionNO
0.3Infinite expansionNO

The Key Insight

IEEE 754 with 0.001:  Input error + Accumulation error = Large error
VLA with 0.001:       Input error + ZERO              = Input error only
VLA with 0.125:       ZERO        + ZERO              = TRUE ZERO

VLA guarantees your arithmetic is perfect. If you want perfect results, also ensure your inputs are perfectly representable.

Input Precision Utilities

VLA provides utilities to help you work with binary-exact inputs for TRUE zero error:

FunctionDescriptionExample
vla.is_exact(x)Check if float is binary-exactvla.is_exact(0.125) → True
vla.to_exact(x)Snap to nearest binary-exact valuevla.to_exact(0.001) → 0.0009765625
vla.frac(n, d)Create exact fraction tensorvla.frac(1, 1024)
vla.dyadic(x)Find closest p/2^q rationalvla.dyadic(0.001)
x.to_decimal()Convert VLADecimal to exact Python Decimalvla.Decimal(t).to_decimal()

Example: Using Exact Fractions

import torch
from simgen import vla

# Instead of 0.001 (NOT exact), use vla.frac():
dt = vla.frac(1, 1024)  # Exact 0.0009765625

# Check if a value is exact
vla.is_exact(0.125)  # True (1/8 = 2^-3)
vla.is_exact(0.001)  # False (infinite binary expansion)

# See exact stored value with VLADecimal
x = vla.Decimal(torch.tensor([0.001], device='cuda'))
print(x.to_decimal())  # Shows exact stored representation

# Find the best dyadic approximation
num, denom, exact, error = vla.dyadic(0.001)
# (1, 1024, 0.0009765625, 0.0000234375)

Pro Tip: Powers of 2

Fractions with power-of-2 denominators (2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, ...) are always exactly representable. Use vla.frac(1, 1024) instead of 0.001 for TRUE zero error.

Global Enable Mode

Patch ALL PyTorch operations with one line:

import torch
from simgen import vla

vla.enable()  # Now ALL torch ops use VLA!

# These now use exact arithmetic automatically:
torch.sum(x)       # Uses VLA internally
torch.matmul(A, B) # Uses VLA internally
model(input)       # Entire model uses VLA!

vla.disable()  # Restore standard PyTorch ops

Context Manager

with vla.mode():
    # All operations in this block use VLA
    result = torch.sum(x)
    output = model(input)
# Back to standard PyTorch outside the block

Core Reductions

FunctionDescriptionExample
vla.sum(x)Exact sum with zero accumulation errorvla.sum(tensor)
vla.mean(x)Exact meanvla.mean(tensor)
vla.var(x)Exact variancevla.var(tensor)
vla.std(x)Exact standard deviationvla.std(tensor)
vla.norm(x, p=2)Exact Lp normvla.norm(tensor)
vla.prod(x)Exact productvla.prod(tensor)
vla.cumsum(x)Exact cumulative sumvla.cumsum(tensor)
vla.logsumexp(x)Numerically stable log-sum-expvla.logsumexp(tensor)
vla.min(x)Minimum valuevla.min(tensor)
vla.max(x)Maximum valuevla.max(tensor)

Matrix Operations

FunctionDescriptionExample
vla.dot(a, b)Exact dot productvla.dot(x, y)
vla.matmul(a, b)Exact matrix multiplicationvla.matmul(A, B)
vla.mm(a, b)Alias for matmulvla.mm(A, B)
vla.bmm(a, b)Batched matrix multiplicationvla.bmm(batch_A, batch_B)
vla.linear(x, w, b)Linear layer: y = xW^T + bvla.linear(x, weight, bias)
vla.einsum(eq, *ops)Einstein summationvla.einsum('ij,jk->ik', A, B)

Element-wise Arithmetic

FunctionDescriptionExample
vla.add(a, b)Addition with error capturevla.add(x, y)
vla.sub(a, b)Subtractionvla.sub(x, y)
vla.mul(a, b)Multiplication with error capturevla.mul(x, y)
vla.div(a, b)Divisionvla.div(x, y)
vla.neg(x)Negationvla.neg(x)
vla.abs(x)Absolute valuevla.abs(x)
vla.pow(x, n)Powervla.pow(x, 2)
vla.clamp(x, min, max)Clamp values to rangevla.clamp(x, 0, 1)
vla.fmod(x, y)Floating-point modulovla.fmod(x, y)

Transcendental Functions

FunctionDescriptionExample
vla.exp(x)Exponential functionvla.exp(x)
vla.log(x)Natural logarithmvla.log(x)
vla.sqrt(x)Square rootvla.sqrt(x)
vla.rsqrt(x)Reciprocal square rootvla.rsqrt(x)

Trigonometric Functions

FunctionDescriptionExample
vla.sin(x)Sinevla.sin(x)
vla.cos(x)Cosinevla.cos(x)
vla.tan(x)Tangentvla.tan(x)
vla.asin(x)Inverse sine (arcsin)vla.asin(x)
vla.acos(x)Inverse cosine (arccos)vla.acos(x)
vla.atan(x)Inverse tangent (arctan)vla.atan(x)
vla.atan2(y, x)Two-argument inverse tangentvla.atan2(y, x)

Hyperbolic Functions

FunctionDescriptionExample
vla.sinh(x)Hyperbolic sinevla.sinh(x)
vla.cosh(x)Hyperbolic cosinevla.cosh(x)
vla.tanh(x)Hyperbolic tangentvla.tanh(x)

Rounding Functions

FunctionDescriptionExample
vla.floor(x)Floor (round down)vla.floor(x)
vla.ceil(x)Ceiling (round up)vla.ceil(x)
vla.round(x)Round to nearest integervla.round(x)
vla.trunc(x)Truncate toward zerovla.trunc(x)

Comparison Functions

FunctionDescriptionExample
vla.sign(x)Sign function (-1, 0, or 1)vla.sign(x)
vla.eq(x, y)Element-wise equalityvla.eq(x, y)
vla.ne(x, y)Element-wise not equalvla.ne(x, y)
vla.lt(x, y)Element-wise less thanvla.lt(x, y)
vla.le(x, y)Element-wise less or equalvla.le(x, y)
vla.gt(x, y)Element-wise greater thanvla.gt(x, y)
vla.ge(x, y)Element-wise greater or equalvla.ge(x, y)
vla.where(c, x, y)Conditional selectionvla.where(cond, x, y)

Activation Functions

FunctionDescriptionExample
vla.relu(x)ReLU activationvla.relu(x)
vla.sigmoid(x)Sigmoid activationvla.sigmoid(x)
vla.leaky_relu(x, slope)Leaky ReLUvla.leaky_relu(x, 0.01)

Signal Processing

FunctionDescriptionExample
vla.fft(x)1D Fast Fourier Transformvla.fft(signal)
vla.ifft(x)1D Inverse FFTvla.ifft(spectrum)
vla.rfft(x)1D Real FFTvla.rfft(signal)
vla.irfft(x)1D Inverse Real FFTvla.irfft(spectrum)
vla.conv2d(x, w)2D Convolutionvla.conv2d(image, kernel)

Linear Algebra

FunctionDescriptionExample
vla.trace(A)Matrix trace (sum of diagonal)vla.trace(matrix)
vla.det(A)Matrix determinantvla.det(matrix)
vla.inv(A)Matrix inversevla.inv(matrix)
vla.solve(A, B)Solve linear system Ax = Bvla.solve(A, b)
vla.mse_loss(pred, target)Mean squared error lossvla.mse_loss(pred, y)

Precision Chaining with return_vla

For maximum precision in chained operations, use return_vla=True:

# Standard: precision lost at each step
r = vla.sqrt(vla.add(vla.mul(x, x), vla.mul(y, y)))

# Chained: full precision preserved through entire computation
x2 = vla.mul(x, x, return_vla=True)
y2 = vla.mul(y, y, return_vla=True)
r2 = vla.add(x2, y2, return_vla=True)
r = vla.sqrt(r2)  # Final collapse to tensor

Critical for: Orbital mechanics, long-running integrations, financial calculations, any computation with many sequential operations.

Checksums & Verification

VLA produces bit-identical results across all GPU architectures. Use checksums to verify reproducibility:

# Compute deterministic checksum
result = vla.matmul(A, B)
cs = vla.checksum(result)  # "6ece6956f187064f"

# Full 64-char SHA256
full_cs = vla.checksum_hex(result)

# Verify result matches expected checksum
vla.verify(result, "6ece6956f187064f")  # Raises ValueError if mismatch

# Non-raising version
is_valid = vla.verify(result, "6ece6956f187064f", raise_on_mismatch=False)

Examples

Example 1: The Kahan Sum Test

Standard floating-point fails this classic test:

import torch
from simgen import vla

# 1e20 + 10000 ones - 1e20 = should be 10000
data = torch.tensor([1e20] + [1.0]*10000 + [-1e20], device='cuda')

print(f"FP32: {data.sum().item()}")           # 0.0 (WRONG!)
print(f"FP64: {data.double().sum().item()}")  # 0.0 (WRONG!)
print(f"VLA:  {vla.sum(data).item()}")        # 10000.0 (CORRECT!)

Example 2: Cross-GPU Verification

import torch
from simgen import vla

torch.manual_seed(42)
A = torch.randn(1024, 1024, device='cuda')
B = torch.randn(1024, 1024, device='cuda')

result = vla.matmul(A, B)
checksum = vla.checksum(result)

print(f"Checksum: {checksum}")
# This EXACT checksum will be produced on ANY NVIDIA GPU:
# RTX 4070, Tesla T4, A100, H100, etc.

Example 3: Orbital Mechanics

import torch
from simgen import vla

# Satellite orbital parameters (ISS altitude)
r0, v0 = 6779.0, 7.66  # km, km/s
GM = 398600.4418       # km^3/s^2

x = torch.tensor([r0], device='cuda')
y = torch.tensor([0.0], device='cuda')
vx = torch.tensor([0.0], device='cuda')
vy = torch.tensor([v0], device='cuda')
dt = torch.tensor([1.0], device='cuda')

# Propagate orbit with chained precision
for _ in range(55000):  # ~10 orbits
    # Compute radius with full precision chain
    x2 = vla.mul(x, x, return_vla=True)
    y2 = vla.mul(y, y, return_vla=True)
    r2 = vla.add(x2, y2, return_vla=True)
    r = vla.sqrt(r2)

    # Update position and velocity...

final_r = vla.sqrt(vla.add(vla.mul(x, x), vla.mul(y, y)))
print(f"Final orbital radius: {final_r.item():.4f} km")

Example 4: Financial Calculations

import torch
from simgen import vla

# Portfolio values (mixed magnitudes)
positions = torch.tensor([
    1_000_000_000.00,  # $1B position
    0.01,              # 1 cent
    -999_999_999.99,   # Large short
    50_000.50,         # Medium position
], device='cuda')

# Standard sum loses the penny
fp32_sum = positions.sum().item()
print(f"FP32 sum: {fp32_sum}")  # May lose precision

# VLA preserves every cent
vla_sum = vla.sum(positions).item()
print(f"VLA sum:  {vla_sum}")  # Exact to the penny

# For audit trails, use exact
exact_total = vla.sum_exact(positions)
print(f"Exact:    {exact_total}")

Supported GPU Architectures

ArchitectureGPUsCompute Capability
PascalGTX 1080, P100sm_60, sm_61
VoltaV100sm_70
TuringRTX 2080, T4sm_75
AmpereRTX 3090, A100sm_80, sm_86
Ada LovelaceRTX 4090, 4080, 4070sm_89
HopperH100sm_90

Performance

Matrix SizeCPU DecimalVLA GPUSpeedup
1024x102437 min0.2s12,922x
4096x40961.6 days10s13,934x
10240x1024025.5 days2.7 min13,848x
20480x20480204 days~22 min~13,000x

SimGen VLA v3.5.0 | Clouthier Simulation Labs |kyle@simgen.dev