GPU Poverty and the Escape to 'Framework-less'

Late last year I was building a multi-model agentic pipeline. Not a demo — something I wanted to actually run: audio in, Whisper for transcription, a small intent classifier, a RAG retrieval step, and finally Llama 3.1 8B for the response. Five models, one machine.

The GPU I had was a single RTX 4090 with 24GB of VRAM. That should’ve been enough. Spoiler: the way existing inference stacks work, it wasn’t.

The Actual Problem

Most popular inference stacks are optimized for maximizing throughput of a single active model instance, not coordinating multiple heterogeneous models sharing the same GPU.

When you load an 8B model into vLLM, it takes ownership of the GPU. PagedAttention, the KV cache, the CUDA memory pool — all allocated up front and held. That’s fine if you have one model and that model is always busy.

In an agentic pipeline it’s a disaster. While Whisper is running, Llama is sitting in VRAM doing nothing. When Llama’s turn comes, it’s already there — but so is Whisper, and so is the classifier. You run out of memory not because any one model is too large, but because you’re paying the full residency cost of all of them simultaneously.

The fallback behavior is eviction to host memory, followed by a PCIe transfer when you need the weights back. On a 4090 with PCIe 4.0, that’s roughly 32 GB/s bandwidth. Llama 3.1 8B in bf16 is ~16GB. You do the math.

I tested this explicitly: the conventional setup required two GPUs (an RTX 5080 + RTX 4090) to run the pipeline without hitting the eviction penalty. That felt wrong. The hardware was clearly capable — I just needed the software to be smarter about it.

Early Efforts: The Duct Tape Era

Before I gave up and dropped down to the bare metal, I really tried to make the existing tools work. I had pages of notes on these experiments—heres a snapshot of some of the stuff I tried:

Partial layer quantization in PyTorch: Trying to get the most out of my GPU without nuking the weights, and building testing harnesses to handle FP8.
Dynamic model switching: Swapping models in and out of VRAM on the fly—painfully, unworkably slow.
Dynamic layer rotations: Trying to swap just the active layers of a model like a conveyor belt.
Event-driven orchestration: Wiring everything together across microservices.
State machines: Trying to propagate inference states across the pipelines to keep the whole disjointed mess in sync.

One place this really broke down was when I deployed an audio note taker + data augmentation workflow. No matter what I did, I was consistently looking at ~13–14 seconds of latency — even for simple, predictable flows. That number just wouldn’t move.

Dynamic model switching was where things went ‘aaaaaaaaaaaaah’. On paper it seemed reasonable — keep only the active model in VRAM and swap the rest out.

But once I actually timed it, pulling weights back over PCIe was taking hundreds of milliseconds (even into 10s of seconds) per transition. Individually that didn’t sound catastrophic, but in a pipeline with multiple stages it stacked up fast.

At some point I stopped thinking “this needs optimization”.

I kept running into the same issues:

Orchestration complexity: It scales badly once you have more than a couple of models interacting.
Fragility: Changing even one model in the pipeline had unpredictable effects on latency and memory behavior.
Lack of control: There’s no way to tell the runtime “this model can leave VRAM right now because I won’t need it for a while.”

All of this frustration led me to a rather foolhardy effort: What if we just strip away all the abstractions?

What I Built

Deciding at Compile Time

The first thing I killed was runtime discovery. Every time you run inference through a typical stack, it spends cycles figuring out kernel tiles, block sizes, and memory layouts. This is a tax on every request - though PyTorch is exceptionally optimized for dynamic dispatches - a tax none-the-less.

I moved all of that to a “Builder” phase — a compile step that runs once per target machine and produces a statically optimized binary. On an RTX 4090, the binary already knows the optimal TILE_M and TILE_N for every GEMM shape it’ll encounter - what the Triton compiler does when it encounters an un-cached path. This avoids per-request kernel selection overhead and runtime heuristics .

The output binary is a few hundred MB, compared to the multi-GB runtimes typical of conventional stacks.

Fused Kernels

I also stopped thinking in terms of operators and started thinking in terms of DRAM traffic. A standard forward pass does roughly: load from DRAM → RMSNorm → write to DRAM → load → QKV projection → write → load → RoPE → write. Every round trip to DRAM is expensive.

I wrote fused kernels that combine RMSNorm, QKV projection, and RoPE into a single pass. DRAM sees the input once and the output once. The intermediate activations stay in SRAM the whole time.

The numbers on a fused kernel vs PyTorch:

Operation	PyTorch	Neurafewz	Speedup
Compute-bound fusion	0.46ms	0.34ms	1.34x
Memory-bound fusion	1.98ms	0.95ms	2.07x

(Numbers are median of repeated runs; variance depends on shape and kernel mix.)

Peak utilization on the fused RMSNorm/QKV kernel hit 89.9% of the 4090’s theoretical TFLOPS. Getting that close to the hardware ceiling was the signal that the approach was right.

Treating Models Like Processes

The scheduling insight came from thinking about this the same way an OS thinks about processes. In a standard OS, no single process owns all of physical memory. Memory is allocated, shared, paged in and out as needed. Processes are scheduled — they get CPU time, yield, get preempted.

I implemented something analogous for GPU inference. Models don’t own VRAM permanently. They have a working set that’s resident when they’re active and can be staged out when they’re idle. While the audio kernel is processing, the runtime is DMA-transferring the Llama weights I’ll need for the next stage.

The result: the full pipeline (~11-seconds audio clip → Whisper transcription → Llama 3.1 8B response) runs on a single 4090 in 1.7 seconds (full decode, half precision), without eviction. The conventional setup for the same pipeline needed two GPUs.

That’s a 19% latency improvement and half the hardware cost.

What Surprised Me

A few things didn’t go as expected during this build:

Faster-whisper’s lazy evaluation. When I was benchmarking the pipeline end-to-end, the numbers weren’t making sense. I eventually traced it to model.transcribe() — it returns a generator. The actual inference runs when you iterate over it, not when you call the function. My timeit loops were measuring the generator construction, not the transcription.

The FP8 layout problem. I spent more time than I’d like on the B operand load for the FP8 GEMM kernel. Migrating from ld.shared.v2.u32 to ldmatrix.x4.m8n8.b16 required redesigning the offline pre-tiling permutation to match ldmatrix’s smem layout expectations, including correctly handling the transpose step. The PTX documentation is not generous with examples here.

Bound checks in hot loops. This was actually a lesson I’d already learned while building diff-match-patch-rs — removing bound checks in a hot loop gives you disproportionate gains. Same thing here. The compiler can’t eliminate bounds checks it can’t prove are unnecessary. Restructuring the inner loop to make that proof possible was one of the higher-leverage things I did.

Where This Is Going

The pipeline I described — Whisper, classifier, RAG, LLM — is not an unusual architecture. Anyone building a voice agent or a document assistant is doing something similar. The infrastructure problem I hit is not unique to my setup.

What I’m building with Neurafewz is an inference runtime that takes this approach seriously: compile-time optimization, cross-model memory allocation, and preemptive scheduling as first-class primitives. The goal is to make “run five models on one GPU without paying the coordination tax” not something every team has to figure out from scratch.

More on the architecture in a separate post. For now — if you’re building multi-model pipelines and hitting the same walls, I’d be curious what your setup looks like.

The Actual Problem#

Early Efforts: The Duct Tape Era#

What I Built#

Deciding at Compile Time#

Fused Kernels#

Treating Models Like Processes#

What Surprised Me#

Where This Is Going#