VLA Documentation
Complete API reference for SimGen VLA
Installation
pip install simgen-vlaRequirements: Python 3.10+, PyTorch 2.0+ with CUDA, NVIDIA GPU (Pascal through Hopper: sm_60 to sm_90)
VLADecimal - GPU-Native Extended Precision
NEW in v3.5.0
VLADecimal is a GPU-native extended precision type (106+ bit mantissa) that keeps ALL operations on GPU. No CPU conversions until you explicitly request a Python Decimal.
import torch
from simgen import vla
# Create GPU-native extended precision tensors
x = vla.Decimal(torch.randn(1000, device='cuda'))
y = vla.Decimal(torch.randn(1000, device='cuda'))
# All operations stay on GPU with full precision
result = (x + y * 2).sum()
# Display exact value (converts to Python Decimal only for display)
print(result) # VLADecimal(-12.34567890123456789...)
# Explicit conversion when you need Python Decimal
exact_value = result.to_decimal() # decimal.Decimal object
# Convert back to torch.Tensor when done
tensor = result.to_torch() # float64 tensorFeatures
- 82 methods - Full arithmetic, reductions, linear algebra, trig, and more
- GPU-native - All operations stay on GPU until you explicitly convert
- Chainable -
(x + y).sum().sqrt()preserves precision throughout - Indexing -
x[0],x[1:10]all return VLADecimal - Shape ops - reshape, view, squeeze, transpose, flatten, etc.
Factory Functions
# Create from exact fractions (TRUE zero representation error)
x = vla.Decimal_frac(1, 3) # Exact 1/3
# Create zeros/ones/random
zeros = vla.Decimal_zeros((100, 100))
ones = vla.Decimal_ones((100, 100))
randn = vla.Decimal_randn((100, 100))
# Concatenate/stack VLADecimal tensors
combined = vla.Decimal_cat([x, y], dim=0)
stacked = vla.Decimal_stack([x, y], dim=0)TRUE ZERO Exact Arithmetic
NEW in v3.5.3 - GPU ModularTensor
TRUE ZERO error arithmetic on GPU using proprietary exact computation. Not ~1e-15 approximation - mathematically ZERO error for all operations.
GPU: ModularTensor (CUDA accelerated)
from simgen import vla
# GPU exact arithmetic - TRUE ZERO error
a = vla.ModularTensor.from_fraction(1, 3, shape=(10000,), device='cuda')
b = vla.ModularTensor.from_fraction(1, 6, shape=(10000,), device='cuda')
c = a + b # Exactly 1/2 for all 10,000 elements
# Exact equality check
expected = vla.ModularTensor.from_fraction(1, 2, shape=(10000,), device='cuda')
print((c == expected).all()) # True - TRUE ZERO!
# 100K iterations - still TRUE ZERO
acc = vla.ModularTensor.from_int(0, shape=(1,), device='cuda')
delta = vla.ModularTensor.from_fraction(1, 100000, shape=(1,), device='cuda')
for _ in range(100000):
acc = acc + delta
print((acc == vla.ModularTensor.from_int(1, shape=(1,), device='cuda')).all()) # True!CPU: ModularRational
from simgen import vla
# CPU exact arithmetic
a = vla.ModularRational.from_fraction(1, 3) # Exactly 1/3
b = vla.ModularRational.from_fraction(1, 6) # Exactly 1/6
c = a + b # Exactly 1/2
print(c == vla.ModularRational.from_fraction(1, 2)) # True - EXACT!Why TRUE ZERO?
- TRUE ZERO error - Not ~1e-15, but mathematically ZERO
- GPU accelerated - 444M ops/sec on RTX 4070
- Constant memory - Fixed memory per value regardless of operation count
- Chaotic systems - Lorenz 50,000 steps with exact reversibility
Vectors and Matrices
# Exact vector operations
vec_a = vla.ModularVector.from_fractions([(1,2), (1,3), (1,4)])
vec_b = vla.ModularVector.from_fractions([(2,1), (3,1), (4,1)])
dot = vec_a.dot(vec_b) # Exactly 3
# Exact matrix operations
A = vla.ModularMatrix.from_ints([[1,1],[1,0]]) # Fibonacci matrix
F10 = A.power(10) # F10[0,0] = 89 - EXACT!
# Matrix multiply
I = vla.ModularMatrix.identity(2)
B = A * I # A @ I = A exactlyChaotic Systems (Lorenz)
# Lorenz attractor with EXACT time reversibility
from simgen import vla
def lorenz_derivs(x, y, z):
sigma = vla.ModularRational.from_int(10)
rho = vla.ModularRational.from_int(28)
beta = vla.ModularRational.from_fraction(8, 3)
return sigma*(y-x), x*(rho-z)-y, x*y - beta*z
dt = vla.ModularRational.from_fraction(1, 1000)
x, y, z = [vla.ModularRational.from_int(1)]*3
stored = []
# Forward 5000 steps
for _ in range(5000):
dx, dy, dz = lorenz_derivs(x, y, z)
stored.append((dx, dy, dz))
x, y, z = x + dt*dx, y + dt*dy, z + dt*dz
# Backward 5000 steps (using stored derivatives)
for dx, dy, dz in reversed(stored):
x, y, z = x - dt*dx, y - dt*dy, z - dt*dz
# TRUE ZERO - back to start!
print(x == vla.ModularRational.from_int(1)) # TrueWhen to use each precision level
- VLADecimal (GPU): Production simulations where ~1e-15 error is acceptable
- ModularTensor (GPU): TRUE ZERO on GPU - financial, cryptographic, verification
- ModularRational (CPU): TRUE ZERO scalar arithmetic
Quick Start
import torch
from simgen import vla
# Create test data
x = torch.randn(10000, device='cuda')
# Exact sum with zero accumulation error
result = vla.sum(x)
# Get cross-GPU checksum (SAME on any GPU!)
checksum = vla.checksum(result)
print(f"Checksum: {checksum}") # e.g., "6ece6956f187064f"The Killer Feature
VLA checksums are BIT-IDENTICAL across all NVIDIA GPUs. Run the same computation on RTX 4070, Tesla T4, or A100 - get the exact same checksum.
Understanding VLA's Guarantee
What VLA Guarantees: ZERO ACCUMULATION Error
VLA eliminates accumulation error - the errors that compound when performing arithmetic operations. Every +, -, *, / is mathematically exact:
- Order independence: (a + b) + c = a + (b + c) always (impossible with IEEE 754)
- Cross-GPU reproducibility: Same computation = identical result on any GPU
- No error growth: Million-step simulations don't accumulate drift
What VLA Cannot Fix: INPUT REPRESENTATION Error
VLA cannot fix errors that exist before it sees your data. When you write 0.001 in Python, it's already corrupted:
# 0.001 is NOT exactly representable in binary
# Python stores it as 0.001000000000000000020816681711721685228...
x = torch.tensor([0.001], device='cuda')
# The error already exists BEFORE VLA sees this tensorBinary-Exact vs Non-Binary-Exact Values
import torch
from simgen import vla
# TEST 1: Non-binary-exact input (0.001)
# 0.001 requires infinite bits - stored as approximation
increment = 0.001
x_vla = torch.tensor([0.0], device='cuda')
for _ in range(100_000):
x_vla = vla.add(x_vla, torch.tensor([increment], device='cuda'))
# Result: 100.00000000133288 (input representation error)
# TEST 2: Binary-exact input (0.125 = 1/8 = 2^-3)
# This is EXACTLY representable in binary!
increment = 0.125
x_vla = torch.tensor([0.0], device='cuda')
for _ in range(100_000):
x_vla = vla.add(x_vla, torch.tensor([increment], device='cuda'))
# Result: 12500.0 EXACTLY (TRUE ZERO error)Binary-Exact Values (TRUE Zero Error)
| Value | Binary | Exact? |
|---|---|---|
| 0.5 | 2^-1 | YES |
| 0.25 | 2^-2 | YES |
| 0.125 | 2^-3 | YES |
| 0.0625 | 2^-4 | YES |
| 0.001 | Infinite expansion | NO |
| 0.1 | Infinite expansion | NO |
| 0.3 | Infinite expansion | NO |
The Key Insight
IEEE 754 with 0.001: Input error + Accumulation error = Large error VLA with 0.001: Input error + ZERO = Input error only VLA with 0.125: ZERO + ZERO = TRUE ZERO
VLA guarantees your arithmetic is perfect. If you want perfect results, also ensure your inputs are perfectly representable.
Input Precision Utilities
VLA provides utilities to help you work with binary-exact inputs for TRUE zero error:
| Function | Description | Example |
|---|---|---|
vla.is_exact(x) | Check if float is binary-exact | vla.is_exact(0.125) → True |
vla.to_exact(x) | Snap to nearest binary-exact value | vla.to_exact(0.001) → 0.0009765625 |
vla.frac(n, d) | Create exact fraction tensor | vla.frac(1, 1024) |
vla.dyadic(x) | Find closest p/2^q rational | vla.dyadic(0.001) |
x.to_decimal() | Convert VLADecimal to exact Python Decimal | vla.Decimal(t).to_decimal() |
Example: Using Exact Fractions
import torch
from simgen import vla
# Instead of 0.001 (NOT exact), use vla.frac():
dt = vla.frac(1, 1024) # Exact 0.0009765625
# Check if a value is exact
vla.is_exact(0.125) # True (1/8 = 2^-3)
vla.is_exact(0.001) # False (infinite binary expansion)
# See exact stored value with VLADecimal
x = vla.Decimal(torch.tensor([0.001], device='cuda'))
print(x.to_decimal()) # Shows exact stored representation
# Find the best dyadic approximation
num, denom, exact, error = vla.dyadic(0.001)
# (1, 1024, 0.0009765625, 0.0000234375)Pro Tip: Powers of 2
Fractions with power-of-2 denominators (2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, ...) are always exactly representable. Use vla.frac(1, 1024) instead of 0.001 for TRUE zero error.
Global Enable Mode
Patch ALL PyTorch operations with one line:
import torch
from simgen import vla
vla.enable() # Now ALL torch ops use VLA!
# These now use exact arithmetic automatically:
torch.sum(x) # Uses VLA internally
torch.matmul(A, B) # Uses VLA internally
model(input) # Entire model uses VLA!
vla.disable() # Restore standard PyTorch opsContext Manager
with vla.mode():
# All operations in this block use VLA
result = torch.sum(x)
output = model(input)
# Back to standard PyTorch outside the blockCore Reductions
| Function | Description | Example |
|---|---|---|
vla.sum(x) | Exact sum with zero accumulation error | vla.sum(tensor) |
vla.mean(x) | Exact mean | vla.mean(tensor) |
vla.var(x) | Exact variance | vla.var(tensor) |
vla.std(x) | Exact standard deviation | vla.std(tensor) |
vla.norm(x, p=2) | Exact Lp norm | vla.norm(tensor) |
vla.prod(x) | Exact product | vla.prod(tensor) |
vla.cumsum(x) | Exact cumulative sum | vla.cumsum(tensor) |
vla.logsumexp(x) | Numerically stable log-sum-exp | vla.logsumexp(tensor) |
vla.min(x) | Minimum value | vla.min(tensor) |
vla.max(x) | Maximum value | vla.max(tensor) |
Matrix Operations
| Function | Description | Example |
|---|---|---|
vla.dot(a, b) | Exact dot product | vla.dot(x, y) |
vla.matmul(a, b) | Exact matrix multiplication | vla.matmul(A, B) |
vla.mm(a, b) | Alias for matmul | vla.mm(A, B) |
vla.bmm(a, b) | Batched matrix multiplication | vla.bmm(batch_A, batch_B) |
vla.linear(x, w, b) | Linear layer: y = xW^T + b | vla.linear(x, weight, bias) |
vla.einsum(eq, *ops) | Einstein summation | vla.einsum('ij,jk->ik', A, B) |
Element-wise Arithmetic
| Function | Description | Example |
|---|---|---|
vla.add(a, b) | Addition with error capture | vla.add(x, y) |
vla.sub(a, b) | Subtraction | vla.sub(x, y) |
vla.mul(a, b) | Multiplication with error capture | vla.mul(x, y) |
vla.div(a, b) | Division | vla.div(x, y) |
vla.neg(x) | Negation | vla.neg(x) |
vla.abs(x) | Absolute value | vla.abs(x) |
vla.pow(x, n) | Power | vla.pow(x, 2) |
vla.clamp(x, min, max) | Clamp values to range | vla.clamp(x, 0, 1) |
vla.fmod(x, y) | Floating-point modulo | vla.fmod(x, y) |
Transcendental Functions
| Function | Description | Example |
|---|---|---|
vla.exp(x) | Exponential function | vla.exp(x) |
vla.log(x) | Natural logarithm | vla.log(x) |
vla.sqrt(x) | Square root | vla.sqrt(x) |
vla.rsqrt(x) | Reciprocal square root | vla.rsqrt(x) |
Trigonometric Functions
| Function | Description | Example |
|---|---|---|
vla.sin(x) | Sine | vla.sin(x) |
vla.cos(x) | Cosine | vla.cos(x) |
vla.tan(x) | Tangent | vla.tan(x) |
vla.asin(x) | Inverse sine (arcsin) | vla.asin(x) |
vla.acos(x) | Inverse cosine (arccos) | vla.acos(x) |
vla.atan(x) | Inverse tangent (arctan) | vla.atan(x) |
vla.atan2(y, x) | Two-argument inverse tangent | vla.atan2(y, x) |
Hyperbolic Functions
| Function | Description | Example |
|---|---|---|
vla.sinh(x) | Hyperbolic sine | vla.sinh(x) |
vla.cosh(x) | Hyperbolic cosine | vla.cosh(x) |
vla.tanh(x) | Hyperbolic tangent | vla.tanh(x) |
Rounding Functions
| Function | Description | Example |
|---|---|---|
vla.floor(x) | Floor (round down) | vla.floor(x) |
vla.ceil(x) | Ceiling (round up) | vla.ceil(x) |
vla.round(x) | Round to nearest integer | vla.round(x) |
vla.trunc(x) | Truncate toward zero | vla.trunc(x) |
Comparison Functions
| Function | Description | Example |
|---|---|---|
vla.sign(x) | Sign function (-1, 0, or 1) | vla.sign(x) |
vla.eq(x, y) | Element-wise equality | vla.eq(x, y) |
vla.ne(x, y) | Element-wise not equal | vla.ne(x, y) |
vla.lt(x, y) | Element-wise less than | vla.lt(x, y) |
vla.le(x, y) | Element-wise less or equal | vla.le(x, y) |
vla.gt(x, y) | Element-wise greater than | vla.gt(x, y) |
vla.ge(x, y) | Element-wise greater or equal | vla.ge(x, y) |
vla.where(c, x, y) | Conditional selection | vla.where(cond, x, y) |
Activation Functions
| Function | Description | Example |
|---|---|---|
vla.relu(x) | ReLU activation | vla.relu(x) |
vla.sigmoid(x) | Sigmoid activation | vla.sigmoid(x) |
vla.leaky_relu(x, slope) | Leaky ReLU | vla.leaky_relu(x, 0.01) |
Signal Processing
| Function | Description | Example |
|---|---|---|
vla.fft(x) | 1D Fast Fourier Transform | vla.fft(signal) |
vla.ifft(x) | 1D Inverse FFT | vla.ifft(spectrum) |
vla.rfft(x) | 1D Real FFT | vla.rfft(signal) |
vla.irfft(x) | 1D Inverse Real FFT | vla.irfft(spectrum) |
vla.conv2d(x, w) | 2D Convolution | vla.conv2d(image, kernel) |
Linear Algebra
| Function | Description | Example |
|---|---|---|
vla.trace(A) | Matrix trace (sum of diagonal) | vla.trace(matrix) |
vla.det(A) | Matrix determinant | vla.det(matrix) |
vla.inv(A) | Matrix inverse | vla.inv(matrix) |
vla.solve(A, B) | Solve linear system Ax = B | vla.solve(A, b) |
vla.mse_loss(pred, target) | Mean squared error loss | vla.mse_loss(pred, y) |
Precision Chaining with return_vla
For maximum precision in chained operations, use return_vla=True:
# Standard: precision lost at each step
r = vla.sqrt(vla.add(vla.mul(x, x), vla.mul(y, y)))
# Chained: full precision preserved through entire computation
x2 = vla.mul(x, x, return_vla=True)
y2 = vla.mul(y, y, return_vla=True)
r2 = vla.add(x2, y2, return_vla=True)
r = vla.sqrt(r2) # Final collapse to tensorCritical for: Orbital mechanics, long-running integrations, financial calculations, any computation with many sequential operations.
Checksums & Verification
VLA produces bit-identical results across all GPU architectures. Use checksums to verify reproducibility:
# Compute deterministic checksum
result = vla.matmul(A, B)
cs = vla.checksum(result) # "6ece6956f187064f"
# Full 64-char SHA256
full_cs = vla.checksum_hex(result)
# Verify result matches expected checksum
vla.verify(result, "6ece6956f187064f") # Raises ValueError if mismatch
# Non-raising version
is_valid = vla.verify(result, "6ece6956f187064f", raise_on_mismatch=False)Examples
Example 1: The Kahan Sum Test
Standard floating-point fails this classic test:
import torch
from simgen import vla
# 1e20 + 10000 ones - 1e20 = should be 10000
data = torch.tensor([1e20] + [1.0]*10000 + [-1e20], device='cuda')
print(f"FP32: {data.sum().item()}") # 0.0 (WRONG!)
print(f"FP64: {data.double().sum().item()}") # 0.0 (WRONG!)
print(f"VLA: {vla.sum(data).item()}") # 10000.0 (CORRECT!)Example 2: Cross-GPU Verification
import torch
from simgen import vla
torch.manual_seed(42)
A = torch.randn(1024, 1024, device='cuda')
B = torch.randn(1024, 1024, device='cuda')
result = vla.matmul(A, B)
checksum = vla.checksum(result)
print(f"Checksum: {checksum}")
# This EXACT checksum will be produced on ANY NVIDIA GPU:
# RTX 4070, Tesla T4, A100, H100, etc.Example 3: Orbital Mechanics
import torch
from simgen import vla
# Satellite orbital parameters (ISS altitude)
r0, v0 = 6779.0, 7.66 # km, km/s
GM = 398600.4418 # km^3/s^2
x = torch.tensor([r0], device='cuda')
y = torch.tensor([0.0], device='cuda')
vx = torch.tensor([0.0], device='cuda')
vy = torch.tensor([v0], device='cuda')
dt = torch.tensor([1.0], device='cuda')
# Propagate orbit with chained precision
for _ in range(55000): # ~10 orbits
# Compute radius with full precision chain
x2 = vla.mul(x, x, return_vla=True)
y2 = vla.mul(y, y, return_vla=True)
r2 = vla.add(x2, y2, return_vla=True)
r = vla.sqrt(r2)
# Update position and velocity...
final_r = vla.sqrt(vla.add(vla.mul(x, x), vla.mul(y, y)))
print(f"Final orbital radius: {final_r.item():.4f} km")Example 4: Financial Calculations
import torch
from simgen import vla
# Portfolio values (mixed magnitudes)
positions = torch.tensor([
1_000_000_000.00, # $1B position
0.01, # 1 cent
-999_999_999.99, # Large short
50_000.50, # Medium position
], device='cuda')
# Standard sum loses the penny
fp32_sum = positions.sum().item()
print(f"FP32 sum: {fp32_sum}") # May lose precision
# VLA preserves every cent
vla_sum = vla.sum(positions).item()
print(f"VLA sum: {vla_sum}") # Exact to the penny
# For audit trails, use exact
exact_total = vla.sum_exact(positions)
print(f"Exact: {exact_total}")Supported GPU Architectures
| Architecture | GPUs | Compute Capability |
|---|---|---|
| Pascal | GTX 1080, P100 | sm_60, sm_61 |
| Volta | V100 | sm_70 |
| Turing | RTX 2080, T4 | sm_75 |
| Ampere | RTX 3090, A100 | sm_80, sm_86 |
| Ada Lovelace | RTX 4090, 4080, 4070 | sm_89 |
| Hopper | H100 | sm_90 |
Performance
| Matrix Size | CPU Decimal | VLA GPU | Speedup |
|---|---|---|---|
| 1024x1024 | 37 min | 0.2s | 12,922x |
| 4096x4096 | 1.6 days | 10s | 13,934x |
| 10240x10240 | 25.5 days | 2.7 min | 13,848x |
| 20480x20480 | 204 days | ~22 min | ~13,000x |
SimGen VLA v3.5.0 | Clouthier Simulation Labs |kyle@simgen.dev