Slashing LLM Kernel Overhead

Explore a Rust multi‑process approach that reduces GPU kernel launches via lock‑free shared‑memory batching, achieving 90% launch reduction and 22% speedup.

Overview

LLM inference often involves millions of kernel launches. The token-by-token nature of decode workloads bombards the GPU with smaller, sequential operations, creating a hidden bottleneck in driver overhead, not compute.

We’ll demonstrate a multi-process Rust application that decouples inference logic from GPU execution using a lock-free shared memory queue. This architecture enables intelligent, on-the-fly batching of identical operations into a single, efficient cuBLASLt call.

The results are significant: we’ll show a 90%+ reduction in kernel launches and a 22% speedup on a realistic FP16 decode workload. We’ll also explore why this same technique results in a slight slowdown for compute-bound prefill workloads, providing a nuanced, first-principles look at a core optimization used by all major inference engines.

Update: GitHub repo added

Links

Tech stack