In my previous series Desktop QA Assistant With Llama3 in Rust we built a desktop app capable of interfacing with LLaMA3. In this post, we’ll extend the modality of the app to accept audio instructions instead of just text.
In this part of the series, we’ll deal with the setup, loading the models and get our text inference up using the rust ML framework called Candle.
The series:
Who is this for?
You’ll feel right at home if you are a programmer, have some exposure to Rust and a bit of experience working with Svelte, React or any other modern client-side framework.
Tools and Libraries
- Rust - Install Rust
- Tauri - A cross-platform desktop app toolkit built on Rust
- SvelteKit - For the quick and simple UI
- Meta LLaMA3 8B - we are going to be using a
ggufversion of LLaMA3 8B.ggufis fileformat created by Georgi Gerganov - Distil Whisper V3-large - which is the knowledge distilled version of OpenAI Whisper-large-v3
- Candle a minimalist ML framework in rust by the awesome HuggingFace🤗 folks
Note on GGUFGGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.
If you are not familiar with the
GGML/GGUFecosystem, I’ve written a small note about it here
About Tauri 2.0 BetaI’m taking this opportunity to test out Tauri 2.x Beta, I’ve never used this before so might change a few things from our setup of the previous project andhiccup warningsin advance
TL;DR
The Final Output
Alert: contains audio
A note on Multimodality: Model vs Pipeline:
Modality:
Learning modalities are the sensory channels or pathways through which individuals give, receive, and store information. Perception, memory, and sensation comprise the concept of modality. The modalities or senses include visual, auditory, tactile/kinesthetic, smell, and taste.
In context of ML/ DL, modality refers to the different kinds of data a model can work with, comprehend or process.
Multimodal Model:
Unimodal models like BERT Family uses text as their modality or ResNet-18 uses image as their modality respectively, i.e. they can comprehend, interpret and work with text (for BERT) and image (ResNet) data, that’s what they have been trained on and what they can run inference on.
On the other hand, multimodal models can work with, comprehend and interpret multiple modalities like text, image, audio and so on.
E.g.
- ChatGPT can work with text, image and audio inputs & outputs
- Open source LLaVA Family of models are trained to work with both
textandimageinputs
Simply put, a multimodal model can work with more than one type of input or output.
Multimodal Pipeline: Mocking Multimodality with multiple unimodal models
Multimodal models tend to be of complex architectures or, sometimes, exhibit not-so-efficient jack of all trades, master of none problems. While solving real world problems with AI/ML I’ve often found a bunch of specialist models working together produce significantly better results than a multimodal generalist.
In our this project we are going to be stitching together 2 specialists, the OpenAI Whisper audio to text model for transcribing the audio and Meta LLaMA3 text & language specialist model for our LLM backend. We’ll stitch them together with our Multimodal Pipeline.
Setup
Setting up tauri
With the rust toolchain and tauri-cli installed, we’ll just run
cargo create-tauri-app audio-instruct --beta
Prerequisites
For the options create-tauri-app --beta asks for, my choices were as follows:
- Frontend Language -> Typescript
- Package Manager -> npm
- UI Template -> Svelte
- UI Flavor -> Typescript
- Mobile Project -> No - we’ll try this in a future project 🤩

Now, we’ll move into our project directory and run
npm i
After the installation completes, let’s run our desktop app for the first time.
npm run tauri dev
And we get a neat looking default window

Observations
- Tauri 2.0 feels like a much more polished product than what we got in Tauri v1 (which is expected). The out-of-the-box experience is far superior to the v1 experience, the initial boilerplate setup has also reduced significantly. 🥳
- But, because Tauri 2.0 is much more powerful, it’s permission structures are now a lot more modular, and the
v2.xdocumentation is still early and requires a bit of deep dive to find the right stuff.
The project/ directory structure looks familiar - at-least superficially I don’t see a major difference from Tauri v1. You can find a quick note and explanation about the project layout here
A notable change to the Tauri v1 structure being the audio-instruct/src-tauri/capabilities directory. As per the Tauri 2.0 documentation about capabilities this directory will contain a set of permissions defined as json mapped to the application windows and webviews by their respective label. This is what my audio-instruct/src-tauri/capabilities/default.json looks like (to enable dialogs):
{
"$schema": "< path to >desktop-schema.json",
"identifier": "default",
"description": "Capability for the main window",
"windows": ["main"],
"permissions": [
"path:default",
"event:default",
"window:default",
"app:default",
"image:default",
"resources:default",
"menu:default",
"tray:default",
"shell:allow-open",
"dialog:allow-open",
"dialog:allow-save",
"dialog:allow-message",
"dialog:allow-ask",
"dialog:allow-confirm",
"dialog:default"
]
}Setting up the backend dependencies
Candle first
We are going to be using Candle as our ML framework for our inference. Candle requires a slightly more involved setup, let’s be done with that. The instructions for setting up candle for a project can be found here
cargo add --git https://github.com/huggingface/candle.git candle-core --features "metal"
Choosing a `Backend`Because I’m on a Mac M1 I’m using the
metalbackend. Candle also works withCuda,mkletc. Choose your appropriate GPU Backend.Candle has reasonably elaborate guides, check it out if you get stuck (I did, when I attempted this for the first time).
More on Tauri
Tauri 2.0 introduces a much more powerful and modular approach to access, permissions and scopes. We’ll need to add a couple of plugin crates to our Cargo.toml to get our Dialog and File System access right.
- Crate
tauri-plugin-dialogfor enabling frontend dialogs, confirmation etc. - Crate
tauri-plugin-fsfor filesystem access
Usual suspects
Now, with candle-core added to our dependencies, let’s go ahead and add the usual suspects; logs & pretty_env_logger for logs, anyhow for errors, etc.
| |
NoteNote how I defined
featuresfor our tauri project. This will help us create theGpuDeviceautomatically based on thecfg!(feature = <something>)macro.I’ll run the app with
--features "metal"and that should initiate the device automatically. If you are usingcudaor some other Gpu device, make sure to change this to--features "cuda"etc..
We’ve added:
anyhowto find our way around errorscandle-transformersto almost plug and playwhisperandllamamodels,candle-nnto work with tensorshf-hubto download the modelslogandpretty_env_loggerfor some neat loggingtokenizers- for decoding our data for inference withwhisperrand- we’ll need to generate some rand ranges for inference
Structuring our App
We’ll end up having the following overall structure and modules for our inference engine:
- a module
instructwhere the state of the app will be maintained andhandlerswill call into - a
whispermodule to do the audio transcription and associated pre and post processing - a
llamamodule for the text inference - we’ll also end up with a
utilsmodule, we’ll always need some helper functions that doesn’t fit anywhere - and a
commandsmodule where we’ll define ourhandler(orcommandsin Tauri universe)
Let’s just go ahead and create these empty files (instruct.rs, whisper.rs, llama.rs and so on …) and declare them as modules in our main.rs
A last bit of boilerplate I’ve always needed to do is to update the identifier field src-tauri/tauri.conf.json to something unique. Let’s change this to audio-instruct.llm.
Our main.rs should now look like the following:
| |
App-state, commands and communication
To get our app to work, we’ll need:
- An app state that will be instantiated on launch, this will basically hold our models once loaded and expose APIs for preprocessing, inference etc.
- A way of communicating the
instructionorcommandfrom the front-end to the backend engine - We’ll capture audio through the front-end interface - we’ll then send this for transcription, we could do this by recording the whole audio in the client interface and then sending the whole blob to the backend. But let’s try to make this a little jucier. We’ll attempt to buffer chunks of the audio and
emitit to the backend and we’ll also need a MPSC (multiple producer single consumer) channel to orchestrate this. More on this later. - And of course, last but not the least, we’ll need a couple of
structsto create instances of thecandlemodels and some associated methods.
App-state with struct Instruct
We define our struct Instruct in audio-instruct/src-tauri/src/instruct.rs, we’ll also define an initializer for it which would be responsible for instantiating the models or download it from HuggingFace Hub with the hf_hub crate.
| |
The event listener is pretty-much an empty shell now, we’ll change that soon enough. The idea around this communication pattern is simple - the command handlers will receive a request which will be pushed to the MPSC channel, the end user doesn’t need to wait for it to finish processing.
Command handlers
A Command in tauri is simply a request sent by the client (in our case the front-end of the app), to the server (the backend of our app). Unlike a typical webserver, this is not a HTTP request rather something closer to remote procedure calls.
To register a command handler, we’ll need to inform tauri builder about it during initialization. Let’s define them in our audio-instruct/src-tauri/commands.rs.
| |
Ok, so quite a lot happening there, lets break it down.
The
struct Commanddefines a structure for our incoming instruction. Unlike our previoustext-onlyinference attempt, we can now have a text as well as an audio instruction mode. For text instruction relatively like we did in our previous blogs, simply accepting the text input but for audio instruction doesn’t contain any data. That’s because the data for the audio chunks is transmitted over infn audio_chunks(), each chunk is then (preprocessed) and stored by theWhisperWrapobject and then, once we sendcommand ask()withaudio: Some(true)the inference runs on the existing data.The
enum Modeis simply to identify what kind ofcommandis being requested inask()Structs
ResponseandMetasimply hold our inference and a bunch of metadata around itfn ask()is a command handler which processes incomingCommandand responds with aResponseFinally, the handler
fn audio_chunk()is slightly different (and this is why I chose Tauri 2.0) for this project. You see in previous version ofTaurithe incoming data intauri::ommandneeded to betext serializable, that kind of defeats the purpose of sendingchunkeddata. Tauri 2.0 introduces this capability (I couldn’t find any documentation yet, but some reference from this github issue). Hence, thefn audio_chunk()is basically reading the incoming bytes as aVec<f32>which is required for our chunk processing.
Let’s modify our fn main() to account for these handlers.
| |
Ok, great, now that our State, commands and a simple communication infrastructure is setup, lets focus our attention on the models.
Models
We are going to be using two models, a Whisper based distil-whisper-large-v3 model for our audio transcription and a LLaMA3 8B gguf quantized model for our text inference. When our struct Instruct initializes, it would need to instantiate both the models. We’ll define a wrapper struct (let call them struct WhisperWrap and struct LlamaWrap) for each of the models. These structs would expose their init() constructor, inference methods, pre and postprocessing logic if any and would be responsible downloading the model if and when required.
Model Loading
Instantiation and model loading will largely involve the following steps:
Check if required files exist locally in our
app_data_dir(), if they don’t exist download them fromhf_hub.Each model should then have a mechanism to initialize their respective models based on their configurations. Now, an interesting thing over here is that we are using two different model types, the LLaMA3 model is a
gguf quantizedvariant while thedistil-whisper-large-v3model is a HuggingFace style[safetensors](https://huggingface.co/docs/safetensors/en/index)model,Candleexposes slightly different ways of loading them and we’ll need to account for that.
With that information, let’s code up the model loaders.
struct WhisperWrap and model distil-whisper-large-v3
For a non-gguf candle model to load we’ll need:
- a
tokenizer.jsonfile specific to the model - this specifies the vocabulary of the model along with a bunch of othertokenizerconfigurations - a
config.jsonfile for the model - the model architecture is defined here - and finally the
model.safetensorsfile - these are themodel weightsand in our case it’s offloat16format
ReferenceCheck out the model card fordistil-whisper-large-v3here, browse through the Files and versions section to see what other formats are available
In our audio-instruct/src-tauri/src/whisper.rs we’ll first define a struct WhisperWrap and its initializers:
| |
This requires a quick explanation. The model field of the struct will hold an instance of the Whisper model and the tokenizer field is an instance of struct Tokenizer from the tokenizers crate, yet again by the awesome Huggingface team. Most of the fields in our struct WhisperWrap should be self explanatory, but mel_filters require a deep dive.
mel_filters: Mel-Spectrogram
Here’s a Fantastic Writeup on Mel-Spectrogram.
In text processing we convert the incoming text (words or related bytes etc.) into a bunch of ids or tokens - each would be a part of the vocabulary of the model, a set of tokens that the model knows from its training. Anything beyond that is unknown and more often in recent models would be represented by some form of UNK token. These tokens are your inputs to a text LLM.
Audio input is far more complicated. If I understand correctly, this is what is happening:
A sliding window is applied to the audio waveform and then a Fourier transform is applied to each window. This converts the audio signal to
time-frequencyrepresentation.These
magnitudesare then converted to the Mel scale, which approximates human auditory perception, a pre-computedmel filterbankis applied to this data. This is what ourmel_filterfield ofstruct WhisperWrapmaintains.A bunch of processing steps (log, normalization… depends I guess on models) later, specifically for
Whisperthe spectrograms are converted into 30s chunks, shorter sequences arepadded. These are your input features for a whisper model.
NoteI don’t have expert level understanding of the steps above; I might have gotten it wrong or missed a few. Please do point out in case if find an issue.
Now, that we have some understanding of Mel Spectrogram and input to whisper, lets continue with our model initialization.
| |
So that would be enough to initialize our model. The comments should be enough to detail out what we are doing, so I’ll not spend a lot of time on this now.
struct LlamaWrap and model LLaMA3 Quant GGUF
Loading the LLaMA3 GGUF model is a lot less involving, that’s because our gguf file is a self-contained entity with everything necessary to run the model. Also, the awesome team of Candle has already provided us with a simple interface to just load the model file.
| |
And that’s it. Now, let’s run our app and see if everything is in order.
NoteThe
LogitProcessorandModelWeightsstructs require mutable borrow during the inference pass. Notice how we wrap these fields in Arc<Mutex> fields.There are other ways of doing this, but I’ll leave if up to you to use your imagination. :)
RUST_LOG=info npm run tauri dev --release -- --features metal
NoteI’m passing the
--releaseflag to get the best out of the loadThe
--features metalis to ensure that ourfn device()helper creates a Metal device. If you are usingcudaensure that you have installedcandleincudamode and pass--features cudato the launch command
If everything has gone according to plan, the models would be downloaded, and the app should show the default tauri welcome window in a few seconds.
Now that our model loading is complete, let us set up our text inference flow.
Text inference
We’ve done this before in our previous series Desktop QA Assistant with LLaMA3, our current implementation should not be very different. It’s not a strait copy paste job because we have switched our ML framework from GGML wrapper llama3_cpp-rs to huggingface candle, but it should be very similar.
First, let’s add some methods to our struct LlamaWrap to accept an incoming text request, preprocess and generate some response.
| |
Let’s write a quick testcase to check if our generation works.
| |
Now, let’s run our test. Note the flags I’m passing to enable features metal.
cd src-tauri
RUST_LOG=info cargo test llama_infer --release --features metal -- --nocapture
cd ..
And WALLAH … our text inference works!
Let’s enable our text inference through the public API before moving on to the Audio inference. It’s simple, we have already stubbed out the pub fn text() method of our struct Instruct, we’ll just remove the todo!() with an actual call. Yet another todo!() is DONE.
| |
Wrapping up & Next steps:
In this post we set up the required scaffolding for running our Desktop Application, loaded our LLaMA3 model and got it to generate some text for us. In the next post, Part II of this series, we’ll wrap this up with inference and integrated frontend.
Till then, adios ..
Before we close today …
If you have found this post helpful consider spreading the word, it would act as a strong motivator for me to create more. If you found an issue or hit a snag, reach out to me @beingAnubhab.
Acknowledgements & reference
This project is built on the shoulder of stalwarts, a huge shout-out to all of them
- Rust maintainers for this awesome language
- The tauri app and its creators and maintainers
- Meta for creating LLaMA family of models and giving open-source AI a fair shot
- HuggingFace🤗 for everything they do, including Candle, distil-whisper and Tokenizer
- Georgi Gerganov for creating GGML/ GGUF movement
- Quant Factory Team for the LLaMA GGUF model files
- Svelte team and the creator Rich Harris for Svelte :)
And many, many more …