GPU shuffle operations read from other lanes in a warp. If those lanes are inactive (diverged), you get undefined behavior — not a crash, just silently wrong results. NVIDIA deprecated an entire API family over this. Their own reference code still contains the bug (cuda-samples#398). A plasma physics simulation ran for months with the UB undetected.
warp-types tracks active lane masks in Rust's type system. Warp<All> has shuffle methods. Warp<Even> doesn't — they're not checked and rejected, they literally don't exist on the type. Diverge produces complementary sub-warps; merge requires proof they're complements. The types erase completely — Warp<S> is PhantomData, so the generated PTX is byte-identical to untyped code.
345 tests, 8 real-bug examples (NVIDIA, PyTorch, OpenCV, TVM), real GPU execution on RTX 4000 Ada, and a Lean 4 soundness proof (28 theorems, zero sorry, zero axioms).
GPU shuffle operations read from other lanes in a warp. If those lanes are inactive (diverged), you get undefined behavior — not a crash, just silently wrong results. NVIDIA deprecated an entire API family over this. Their own reference code still contains the bug (cuda-samples#398). A plasma physics simulation ran for months with the UB undetected.
warp-types tracks active lane masks in Rust's type system. Warp<All> has shuffle methods. Warp<Even> doesn't — they're not checked and rejected, they literally don't exist on the type. Diverge produces complementary sub-warps; merge requires proof they're complements. The types erase completely — Warp<S> is PhantomData, so the generated PTX is byte-identical to untyped code.
345 tests, 8 real-bug examples (NVIDIA, PyTorch, OpenCV, TVM), real GPU execution on RTX 4000 Ada, and a Lean 4 soundness proof (28 theorems, zero sorry, zero axioms).