GPU Poverty and the Escape to 'Framework-less'
Late last year I was building a multi-model agentic pipeline. Not a demo — something I wanted to actually run: audio in, Whisper for transcription, a small intent classifier, a RAG retrieval step, and finally Llama 3.1 8B for the response. Five models, one machine. The GPU I had was a single RTX 4090 with 24GB of VRAM. That should’ve been enough. Spoiler: the way existing inference stacks work, it wasn’t....
From 'Very Fast' to '~Fastest': Helping rust unleash compiler optimizations
diff-match-patch-rs A few years back, while building HandyTrain, we decided to build a collaborative content creation feature, among other things we needed a text-synchronization library - a WASM version for the client and a high-performance library for our Go and Rust services. With some research we landed on the fantastic diff match patch algorithms, the diff part is an implementation of this paper (often called the Myer’s Diff algorithm) by Eugene Myer’s....
Desktop App for Document QA with RAG
A DIY style step-by-step guide to building your own cutting-edge GenAI-powered document QA desktop app with RAG.
WASM: The `What`, `When` and `How`
I’ve been using WebAssembly aka WASM for a while in production to do some incredible stuff in the browser, things that would be prohibitive in terms of performance otherwise. Here are some real use-cases I’ve used WASM for: Running statistical processing 10 million rows (roughly 50 columns in each row) of CSV data in browser. This feature required us to create a temporary playground of reports generated for our clients where they could run their own analysis without the need of permanent storage or costly servers....
Voice Assistant Desktop App with LLaMA3 and Whisper in Rust
Step by step tutorial on building a desktop app to interface with LLM LLaMA3 with text and audio instructions in Rust.