25.5 DAYS to 2.7 MIN
13,848x faster. Both EXACT.
13,848x
Speedup
104M
Elements Exact
0
Precision Loss
Free
pip install
Execution Time Comparison
CPU Decimal takes DAYS. VLA takes SECONDS. Both give EXACT results.
| Matrix | Elements | CPU Decimal | VLA GPU | Speedup |
|---|---|---|---|---|
| 512x512 | 262K | 4.6 min | 0.05s | 6,071x |
| 1024x1024 | 1M | 36.7 min | 0.2s | 12,922x |
| 2048x2048 | 4M | 4.9 hrs | 1.3s | 13,814x |
| 4096x4096 | 17M | 1.6 DAYS | 10.1s | 13,934x |
| 6144x6144 | 38M | 5.5 DAYS | 34.3s | 13,885x |
| 8192x8192 | 67M | 13.1 DAYS | 1.4 min | 13,843x |
| 10240x10240 | 105M | 25.5 DAYS | 2.7 min | 13,848x |
Benchmarked on Tesla T4 (Kaggle), February 2026
Cross-GPU Reproducibility
Same checksum on completely different architectures. This is unprecedented.
10240x10240 Matrix Multiply Checksum
RTX 4070 (Ada Lovelace, sm_89)
6ece6956f187064f
Tesla T4 (Turing, sm_75)
6ece6956f187064f
BIT-IDENTICAL
Different GPU architectures, different memory layouts, same exact result
100%
Reproducible
2
GPU architectures verified
0
Bit differences
VLA Beats Everything
Test: 1e20 + 10,000 ones - 1e20. Expected result: 10,000. Only VLA gets it right.
FP32
8,750
Lost 1,250
FP64
7,500
Lost 2,500
80-bit Extended
9,984
Lost 16
VLA
10,000
EXACT
VLA beats Intel 80-bit extended precision hardware - on a consumer GPU.
FP64 Loses Tens of Thousands of Values
VLA recovers ALL of them.
| Test | Expected | FP64 Lost | VLA Lost |
|---|---|---|---|
| 1e20 + 10K - 1e20 | 9,998 | 1,262 (12.6%) | 0 |
| 1e20 + 100K - 1e20 | 99,998 | 25,022 (25.0%) | 0 |
| 1e20 + 500K - 1e20 | 499,998 | 24,862 (5.0%) | 0 |
| 1e20 + 1M - 1e20 | 999,998 | 33,342 (3.3%) | 0 |
Real-World Impact
Financial Transactions
$881,143,573.77
1 million transactions summed
Patriot Missile Tracking
100 Hours
0.1s increments accumulated
Lorenz Chaos System
50,000 Steps
Chaotic trajectory integration
Orbit Propagation
10 Orbits
ISS-altitude satellite tracking
Try It Yourself
All benchmarks are reproducible. Run them on Kaggle or install locally.