GPU Poverty and the Escape to 'Framework-less'
Late last year I was building a multi-model agentic pipeline. Not a demo — something I wanted to actually run: audio in, Whisper for transcription, a small intent classifier, a RAG retrieval step, and finally Llama 3.1 8B for the response. Five models, one machine. The GPU I had was a single RTX 4090 with 24GB of VRAM. That should’ve been enough. Spoiler: the way existing inference stacks work, it wasn’t....