The Problem That Keeps Finding Me

For fifteen-odd years, every few years, I find myself staring at the same problem wearing a different costume. Sometimes it appears as server costs that don’t quite square with usage. Sometimes as a few extra milliseconds in response time—the kind you’d shrug off if you weren’t paying close attention. Sometimes as features that, once they hit scale, start standing out as resource hogs: not broken, just hungry.

“When you hear hoofbeats, think horses, not zebras.” — Dr. Theodore Woodward

Woodward was teaching medical students to not over-diagnose. Most of the time, he was right. But if you listen long enough, you start noticing the hoofbeats have a pattern. Same rhythm. Same gait. Same animal, painted different colors.

I’ve been chasing the same zebra for fifteen years.

2012 — The First Encounter

Picture the internet in 2012. Firefox’s JavaScript engine was briefly faster than Chrome’s. The internet I was building for crawled on 512 Kbps. The “cloud” was still a thing people said with air quotes.

I was building browser-based psychometric games — in-game choices fed an SVM, which fed the next beat of the game, and the cumulative breadth of those choices painted a picture of the player’s behavioral profile.

But games aren’t games if they stutter, and psychometric games especially aren’t psychometric games if the player is thinking more about their internet connection than about their own choices.

The horse: throw more servers, faster API. I went looking for the zebra: why is this round-tripping at all? What if the whole SVM lived in the browser itself? In 2012, writing a pure-JavaScript SVM implementation wasn’t exactly a well-trodden path — but once I got there, 1000x improvement in UX. The inference cost graph went flat, because when inference runs in the user’s browser, the user pays for the compute. For a bootstrapped startup, that wasn’t optimization. That was survival.

I didn’t have a word for what I’d done back then. I thought I’d found a clever hack. I hadn’t. I’d met the zebra for the first time.

2014 — Familiar Hoofbeats

Clusto — what Google Workspace is today, two years early, living inside Gmail with deep integrations into Asana, Trello, Dropbox, Calendar, and half a dozen other tools people actually used.

My co-founder Utkarsh wrote and trained the ML that classified and clustered emails, tasks, notes, and meetings into related Clusters. It worked. The problem was the math of it: every user’s inbox was thousands of emails, we had a lot of users, and server-side classification at that volume started to feel like trying to drain an ocean with a teacup. Every marginal user made our infrastructure sweat a little more — and then, in non-peak hours, it sat idle. This was supposed to be instant. Not lazy.

And again, the horses were all there. Maddy, my co-founder, lost sleep trying to tame them — scale up, batch the jobs, optimize the classifier, rent more iron.

But the zebra was hiding in plain sight: why were we burning compute on the 3,847th email from 2010 when the user was staring at an email from this morning?

We moved the heavy lifting of the immediate into the browser. Everything else could take its own time and saturate our infrastructure without rapid scaling. The math finally made sense.

2016 — The Zebra Started Showing Its Stripes

HandyTrain was where it stopped being a coincidence.

HandyTrain — content creation and distribution for India’s deskless workforce, administered by India’s largest corporates, offline-capable by necessity. ML and AI were everywhere: content creation, translation, adaptive curriculum, progress analysis. The nature of our ML pipelines had also become a lot more sophisticated than the email-clustering days. By early 2019, the first wave of transformer-era models had started making their move — we fell in love with BERT.

I had concocted a BiLSTM-plus-BERT model that inferred the visual hierarchy of content. Creators loved it. But it ran every few hours — an expensive chef in an empty restaurant. Always-on servers made no economic sense. On-demand cold starts took 10–15 minutes; creators gave up waiting.

We rewrote it in WASM (and later WebGPU). Cold start: zero. Our cost to deploy this feature: zero. Content creation that took 8–10 minutes of waiting now took 15 seconds. Adaptive learning got pushed into mobile apps — which accidentally created a category that didn’t exist before: offline adaptive learning.

It’s also where Abhishek walks in. My second-in-command at HandyTrain back then, the person I’d pull into a room when the problem was too tangled for one brain. Donald joined a little later as Product Head, and between the three of us, we’d end up at a whiteboard more often than was probably healthy. But the relentless push towards efficiency gave Donald the freedom to solve deeper problems — which meant more complex pipelines and more models. Neither of us knew it, but years of those arguments were going to make both of them my co-founders.

Not everything lived on the edge — some workflows genuinely belonged on the server. But the question had flipped. It was no longer “where can we afford to run this?” It was “where does this want to run? How much of the peak theoritical efficiency can we achieve?” And the answer was almost always: closer to the user, closer to the data, to the moment of actual use - deep integration of problem, data and usage,

2022 to 2025 — The World Changed and the Zebra Grew Canines

The pre-ChatGPT months were a strange. Something had shifted in the realm of possibilities. GPT-3 was good but then ChatGPT dropped, and suddenly every product in the world had to have an answer to “how are you using AI?” — and most of those answers were, charitably, rushed.

But what made this era different — what gave the old zebra actual teeth — was the sheer mass of the new generation. My 2019 BiLSTM-plus-BERT concoction fit comfortably inside ~200MB. A modern 70B-parameter model weighs 140GB before you’ve asked it a single question. GPUs that comfortably hosted a few whole ML pipeline five years ago now strain under a single LLM. And these models aren’t static — each conversation grows its own live memory footprint (the KV cache), breathing and expanding with every token. The resource hunger of the previous era was a diet compared to this. And then agentic workloads arrived and changed the math again: pipelines that don’t just chain three models but recursively call models, sometimes the same model twenty times in a single user request. The zebra had grown canines.

We’d been adapting LLMs into production workflows for a while - we’ve built real workflows on BERT, T5, GPT-2 and their variants - and two observations dawned on me.

The first

A production pipeline is almost never a single model doing everything. A small classifier for routing, a model for extraction, maybe an embedding model for retrieval — all dancing together to produce a single answer. Each problem needs a different choreography.

The second

Which is the one that keeps me up at night: adding more models compounds the cost of inference, and quickly enough, a pipeline that should have healthy unit economics on paper ends up with profoundly negative unit economics in practice. And when that happens, something sad and very human starts to take over.

Engineers start treating their deployed models as owned resources. Not “tools we can pick from,” but “assets we’ve already bought and have to justify.” Decisions start revolving around what you have rather than what you need. Adding a new model to a pipeline means new resources, which means new bills, which means new conversations with finance — which means, and I’ve watched this happen a hundred times, developers end up firing a 70B-parameter LLM to classify a thousand support tickets. Because they already own it. Because the right tool for the job — a tiny fine-tuned classifier — would mean adding a model, and adding a model means pain.

Same zebra. Infrastructure warps the shape of the solution.

I started building the obvious answer: on-demand, event-driven, resource-orchestrated, plug-and-play AI. It worked — kind of. Orchestrating models as events turned into its own nightmare. One misfire corrupted multiple pipelines. Models sat idle waiting for upstream inference. Under load, each model had its own personality — its own cold-start mood, memory quirks, scaling preferences. Fully-serverless variants had cold starts worse than the problem they solved.

Around this time, Abhishek was deep in Azure, running into the same zebra from the infrastructure side. Image caching, pre-warmed node pools, aggressive driver pre-loading — every trick the cloud gods had published. All clever. None of them dissolving the underlying problem: that the abstractions cloud providers had built for web servers were being asked to do a job they were never designed for.

We compared notes. The conversations started sounding like therapy sessions for two people being haunted by the same ghost.

PyTorch’s Triton compiled deployments were the most efficient by a mile — until you measured cold start, and the compile step dragged first-response latency off a cliff. Every solution was too expensive, too slow, or quietly destroyed accuracy in a way that would only show up in production weeks into deployment.

And then — and I don’t think either of us remembers the exact moment, but I remember the shape of it — Abhishek and I were on a call, mid-gripe, and one of us said it out loud:

Multi-model workflows don’t need better tools. They need a fundamentally different treatment.

Not a better orchestrator. Not a smarter autoscaler. Not a faster cold start on top of the same abstractions that were never built for this. Something underneath the orchestrator. Something that treats “many models, one GPU, coordinated execution” as the first-class object — not a weird edge case bolted onto infrastructure built for web servers. Something, we realized, that looked less like a framework and more like an operating system — but for inference.

The mistake we kept seeing was that we were treating models as deployable services. But the real unit of work had shifted underneath us: many models, shared resources, coordinated execution, and brutal sensitivity to latency, memory, and cost.

That’s the discussion where Neurafewz was born.

2012 — The First Encounter#

2014 — Familiar Hoofbeats#

2016 — The Zebra Started Showing Its Stripes#

2022 to 2025 — The World Changed and the Zebra Grew Canines#

The first#

The second#