Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Slashing LLM Kernel Overhead
Explore a Rust multi‑process approach that reduces GPU kernel launches via lock‑free shared‑memory batching, achieving 90% launch reduction and 22% speedup.
LLM inference often involves millions of kernel launches. The token-by-token nature of decode workloads bombards the GPU with smaller, sequential operations, creating a hidden bottleneck in driver overhead, not compute.
We’ll demonstrate a multi-process Rust application that decouples inference logic from GPU execution using a lock-free shared memory queue. This architecture enables intelligent, on-the-fly batching of identical operations into a single, efficient cuBLASLt call.
The results are significant: we’ll show a 90%+ reduction in kernel launches and a 22% speedup on a realistic FP16 decode workload. We’ll also explore why this same technique results in a slight slowdown for compute-bound prefill workloads, providing a nuanced, first-principles look at a core optimization used by all major inference engines.
Update: GitHub repo added