In my previous series Desktop QA Assistant With Llama3 in Rust we built a desktop app capable of interfacing with LLaMA3. In this post, we’ll extend the modality of the app to accept audio instructions instead of just text.
In this part of the series, we’ll deal with the setup, loading the models and get our text inference up using the rust ML framework called Candle.
The series:
Who is this for?
You’ll feel right at home if you are a programmer, have some exposure to Rust
and a bit of experience working with Svelte
, React
or any other modern client-side framework.
Tools and Libraries
- Rust - Install Rust
- Tauri - A cross-platform desktop app toolkit built on Rust
- SvelteKit - For the quick and simple UI
- Meta LLaMA3 8B - we are going to be using a
gguf
version of LLaMA3 8B.gguf
is fileformat created by Georgi Gerganov - Distil Whisper V3-large - which is the knowledge distilled version of OpenAI Whisper-large-v3
- Candle a minimalist ML framework in rust by the awesome HuggingFace🤗 folks
Note on GGUFGGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.
If you are not familiar with the
GGML
/GGUF
ecosystem, I’ve written a small note about it here
About Tauri 2.0 BetaI’m taking this opportunity to test out Tauri 2.x Beta, I’ve never used this before so might change a few things from our setup of the previous project andhiccup warnings
in advance
TL;DR
The Final Output
Alert: contains audio
A note on Multimodality: Model
vs Pipeline
:
Modality:
Learning modalities are the sensory channels or pathways through which individuals give, receive, and store information. Perception, memory, and sensation comprise the concept of modality. The modalities or senses include visual, auditory, tactile/kinesthetic, smell, and taste.
In context of ML/ DL, modality refers to the different kinds
of data a model can work with, comprehend or process.
Multimodal Model:
Unimodal
models like BERT Family uses text
as their modality or ResNet-18 uses image
as their modality respectively, i.e. they can comprehend, interpret and work with text
(for BERT) and image
(ResNet) data, that’s what they have been trained on and what they can run inference on.
On the other hand, multimodal
models can work with, comprehend and interpret multiple modalities like text, image, audio and so on.
E.g.
- ChatGPT can work with text, image and audio inputs & outputs
- Open source LLaVA Family of models are trained to work with both
text
andimage
inputs
Simply put, a multimodal
model can work with more than one type of input or output.
Multimodal Pipeline: Mocking Multimodality with multiple unimodal
models
Multimodal
models tend to be of complex architectures or, sometimes, exhibit not-so-efficient jack of all trades, master of none
problems. While solving real world problems with AI/ML I’ve often found a bunch of specialist models working together produce significantly better results than a multimodal generalist.
In our this project we are going to be stitching together 2 specialists, the OpenAI Whisper audio to text
model for transcribing the audio and Meta LLaMA3 text & language
specialist model for our LLM backend. We’ll stitch them together with our Multimodal Pipeline
.
Setup
Setting up tauri
With the rust toolchain
and tauri-cli
installed, we’ll just run
cargo create-tauri-app audio-instruct --beta
Prerequisites
For the options create-tauri-app --beta
asks for, my choices were as follows:
- Frontend Language -> Typescript
- Package Manager -> npm
- UI Template -> Svelte
- UI Flavor -> Typescript
- Mobile Project -> No - we’ll try this in a future project 🤩
Now, we’ll move into our project directory and run
npm i
After the installation completes, let’s run our desktop app for the first time.
npm run tauri dev
And we get a neat looking default window
Observations
- Tauri 2.0 feels like a much more polished product than what we got in Tauri v1 (which is expected). The out-of-the-box experience is far superior to the v1 experience, the initial boilerplate setup has also reduced significantly. 🥳
- But, because Tauri 2.0 is much more powerful, it’s permission structures are now a lot more modular, and the
v2.x
documentation is still early and requires a bit of deep dive to find the right stuff.
The project/ directory structure looks familiar - at-least superficially I don’t see a major difference from Tauri v1
. You can find a quick note and explanation about the project layout here
A notable change to the Tauri v1
structure being the audio-instruct/src-tauri/capabilities
directory. As per the Tauri 2.0 documentation about capabilities this directory will contain a set of permissions defined as json
mapped to the application windows and webviews by their respective label
. This is what my audio-instruct/src-tauri/capabilities/default.json
looks like (to enable dialogs):
{
"$schema": "< path to >desktop-schema.json",
"identifier": "default",
"description": "Capability for the main window",
"windows": ["main"],
"permissions": [
"path:default",
"event:default",
"window:default",
"app:default",
"image:default",
"resources:default",
"menu:default",
"tray:default",
"shell:allow-open",
"dialog:allow-open",
"dialog:allow-save",
"dialog:allow-message",
"dialog:allow-ask",
"dialog:allow-confirm",
"dialog:default"
]
}
Setting up the backend dependencies
Candle
first
We are going to be using Candle as our ML framework for our inference. Candle
requires a slightly more involved setup, let’s be done with that. The instructions for setting up candle
for a project can be found here
cargo add --git https://github.com/huggingface/candle.git candle-core --features "metal"
Choosing a `Backend`Because I’m on a Mac M1 I’m using the
metal
backend. Candle also works withCuda
,mkl
etc. Choose your appropriate GPU Backend.Candle has reasonably elaborate guides, check it out if you get stuck (I did, when I attempted this for the first time).
More on Tauri
Tauri 2.0 introduces a much more powerful and modular approach to access
, permissions
and scopes
. We’ll need to add a couple of plugin crates to our Cargo.toml
to get our Dialog
and File System
access right.
- Crate
tauri-plugin-dialog
for enabling frontend dialogs, confirmation etc. - Crate
tauri-plugin-fs
for filesystem access
Usual suspects
Now, with candle-core
added to our dependencies, let’s go ahead and add the usual suspects; logs
& pretty_env_logger
for logs, anyhow
for errors, etc.
|
|
NoteNote how I defined
features
for our tauri project. This will help us create theGpuDevice
automatically based on thecfg!(feature = <something>)
macro.I’ll run the app with
--features "metal"
and that should initiate the device automatically. If you are usingcuda
or some other Gpu device, make sure to change this to--features "cuda"
etc..
We’ve added:
anyhow
to find our way around errorscandle-transformers
to almost plug and playwhisper
andllama
models,candle-nn
to work with tensorshf-hub
to download the modelslog
andpretty_env_logger
for some neat loggingtokenizers
- for decoding our data for inference withwhisper
rand
- we’ll need to generate some rand ranges for inference
Structuring our App
We’ll end up having the following overall structure and modules for our inference engine:
- a module
instruct
where the state of the app will be maintained andhandlers
will call into - a
whisper
module to do the audio transcription and associated pre and post processing - a
llama
module for the text inference - we’ll also end up with a
utils
module, we’ll always need some helper functions that doesn’t fit anywhere - and a
commands
module where we’ll define ourhandler
(orcommands
in Tauri universe)
Let’s just go ahead and create these empty files (instruct.rs, whisper.rs, llama.rs and so on …) and declare them as modules in our main.rs
A last bit of boilerplate I’ve always needed to do is to update the identifier
field src-tauri/tauri.conf.json
to something unique. Let’s change this to audio-instruct.llm
.
Our main.rs
should now look like the following:
|
|
App-state, commands
and communication
To get our app to work, we’ll need:
- An app state that will be instantiated on launch, this will basically hold our models once loaded and expose APIs for preprocessing, inference etc.
- A way of communicating the
instruction
orcommand
from the front-end to the backend engine - We’ll capture audio through the front-end interface - we’ll then send this for transcription, we could do this by recording the whole audio in the client interface and then sending the whole blob to the backend. But let’s try to make this a little jucier. We’ll attempt to buffer chunks of the audio and
emit
it to the backend and we’ll also need a MPSC (multiple producer single consumer) channel to orchestrate this. More on this later. - And of course, last but not the least, we’ll need a couple of
structs
to create instances of thecandle
models and some associated methods.
App-state with struct Instruct
We define our struct Instruct
in audio-instruct/src-tauri/src/instruct.rs
, we’ll also define an initializer for it which would be responsible for instantiating the models or download it from HuggingFace Hub
with the hf_hub
crate.
|
|
The event listener is pretty-much an empty shell now, we’ll change that soon enough. The idea around this communication pattern is simple - the command handlers
will receive a request
which will be pushed to the MPSC
channel, the end user doesn’t need to wait for it to finish processing.
Command
handlers
A Command
in tauri
is simply a request sent by the client
(in our case the front-end
of the app), to the server
(the backend of our app). Unlike a typical webserver, this is not a HTTP
request rather something closer to remote procedure calls
.
To register a command
handler, we’ll need to inform tauri builder
about it during initialization. Let’s define them in our audio-instruct/src-tauri/commands.rs
.
|
|
Ok, so quite a lot happening there, lets break it down.
The
struct Command
defines a structure for our incoming instruction. Unlike our previoustext-only
inference attempt, we can now have a text as well as an audio instruction mode. For text instruction relatively like we did in our previous blogs, simply accepting the text input but for audio instruction doesn’t contain any data. That’s because the data for the audio chunks is transmitted over infn audio_chunks()
, each chunk is then (preprocessed) and stored by theWhisperWrap
object and then, once we sendcommand ask()
withaudio: Some(true)
the inference runs on the existing data.The
enum Mode
is simply to identify what kind ofcommand
is being requested inask()
Structs
Response
andMeta
simply hold our inference and a bunch of metadata around itfn ask()
is a command handler which processes incomingCommand
and responds with aResponse
Finally, the handler
fn audio_chunk()
is slightly different (and this is why I chose Tauri 2.0) for this project. You see in previous version ofTauri
the incoming data intauri::ommand
needed to betext serializable
, that kind of defeats the purpose of sendingchunked
data. Tauri 2.0 introduces this capability (I couldn’t find any documentation yet, but some reference from this github issue). Hence, thefn audio_chunk()
is basically reading the incoming bytes as aVec<f32>
which is required for our chunk processing.
Let’s modify our fn main()
to account for these handlers.
|
|
Ok, great, now that our State
, commands
and a simple communication infrastructure is setup, lets focus our attention on the models.
Models
We are going to be using two models, a Whisper
based distil-whisper-large-v3
model for our audio transcription and a LLaMA3 8B gguf
quantized model for our text inference. When our struct Instruct
initializes, it would need to instantiate both the models. We’ll define a wrapper struct (let call them struct WhisperWrap
and struct LlamaWrap
) for each of the models. These structs would expose their init()
constructor, inference methods, pre and postprocessing logic if any and would be responsible downloading the model if and when required.
Model Loading
Instantiation and model loading will largely involve the following steps:
Check if required files exist locally in our
app_data_dir()
, if they don’t exist download them fromhf_hub
.Each model should then have a mechanism to initialize their respective models based on their configurations. Now, an interesting thing over here is that we are using two different model types, the LLaMA3 model is a
gguf quantized
variant while thedistil-whisper-large-v3
model is a HuggingFace style[safetensors](https://huggingface.co/docs/safetensors/en/index)
model,Candle
exposes slightly different ways of loading them and we’ll need to account for that.
With that information, let’s code up the model loaders.
struct WhisperWrap
and model distil-whisper-large-v3
For a non-gguf candle
model to load we’ll need:
- a
tokenizer.json
file specific to the model - this specifies the vocabulary of the model along with a bunch of othertokenizer
configurations - a
config.json
file for the model - the model architecture is defined here - and finally the
model.safetensors
file - these are themodel weights
and in our case it’s offloat16
format
ReferenceCheck out the model card fordistil-whisper-large-v3
here, browse through the Files and versions section to see what other formats are available
In our audio-instruct/src-tauri/src/whisper.rs
we’ll first define a struct WhisperWrap
and its initializers:
|
|
This requires a quick explanation. The model
field of the struct will hold an instance of the Whisper
model and the tokenizer
field is an instance of struct Tokenizer
from the tokenizers
crate, yet again by the awesome Huggingface team. Most of the fields in our struct WhisperWrap
should be self explanatory, but mel_filters
require a deep dive.
mel_filters
: Mel-Spectrogram
Here’s a Fantastic Writeup on Mel-Spectrogram.
In text
processing we convert the incoming text
(words or related bytes etc.) into a bunch of ids
or tokens
- each would be a part of the vocabulary
of the model, a set of tokens
that the model
knows from its training. Anything beyond that is unknown
and more often in recent models would be represented by some form of UNK
token. These tokens are your inputs to a text LLM
.
Audio input is far more complicated. If I understand correctly, this is what is happening:
A sliding window is applied to the audio waveform and then a Fourier transform is applied to each window. This converts the audio signal to
time-frequency
representation.These
magnitudes
are then converted to the Mel scale, which approximates human auditory perception, a pre-computedmel filterbank
is applied to this data. This is what ourmel_filter
field ofstruct WhisperWrap
maintains.A bunch of processing steps (log, normalization… depends I guess on models) later, specifically for
Whisper
the spectrograms are converted into 30s chunks, shorter sequences arepadded
. These are your input features for a whisper model.
NoteI don’t have expert level understanding of the steps above; I might have gotten it wrong or missed a few. Please do point out in case if find an issue.
Now, that we have some understanding of Mel Spectrogram
and input to whisper, lets continue with our model initialization
.
|
|
So that would be enough to initialize our model. The comments should be enough to detail out what we are doing, so I’ll not spend a lot of time on this now.
struct LlamaWrap
and model LLaMA3 Quant GGUF
Loading the LLaMA3 GGUF
model is a lot less involving, that’s because our gguf
file is a self-contained entity with everything necessary to run the model. Also, the awesome team of Candle
has already provided us with a simple interface to just load
the model file.
|
|
And that’s it. Now, let’s run our app and see if everything is in order.
NoteThe
LogitProcessor
andModelWeights
structs require mutable borrow during the inference pass. Notice how we wrap these fields in Arc<Mutex> fields.There are other ways of doing this, but I’ll leave if up to you to use your imagination. :)
RUST_LOG=info npm run tauri dev --release -- --features metal
NoteI’m passing the
--release
flag to get the best out of the loadThe
--features metal
is to ensure that ourfn device()
helper creates a Metal device. If you are usingcuda
ensure that you have installedcandle
incuda
mode and pass--features cuda
to the launch command
If everything has gone according to plan, the models would be downloaded, and the app should show the default tauri welcome window in a few seconds.
Now that our model loading is complete, let us set up our text inference flow.
Text inference
We’ve done this before in our previous series Desktop QA Assistant with LLaMA3, our current implementation should not be very different. It’s not a strait copy paste job because we have switched our ML framework from GGML wrapper llama3_cpp-rs
to huggingface candle
, but it should be very similar.
First, let’s add some methods to our struct LlamaWrap
to accept an incoming text request
, preprocess and generate some response.
|
|
Let’s write a quick testcase to check if our generation works.
|
|
Now, let’s run our test. Note the flags I’m passing to enable features metal
.
cd src-tauri
RUST_LOG=info cargo test llama_infer --release --features metal -- --nocapture
cd ..
And WALLAH … our text inference works!
Let’s enable our text inference through the public API before moving on to the Audio inference. It’s simple, we have already stubbed out the pub fn text()
method of our struct Instruct
, we’ll just remove the todo!()
with an actual call. Yet another todo!()
is DONE.
|
|
Wrapping up & Next steps:
In this post we set up the required scaffolding for running our Desktop Application
, loaded our LLaMA3
model and got it to generate some text for us. In the next post, Part II of this series, we’ll wrap this up with inference and integrated frontend.
Till then, adios ..
Before we close today …
If you have found this post helpful consider spreading the word, it would act as a strong motivator for me to create more. If you found an issue or hit a snag, reach out to me @beingAnubhab.
Acknowledgements & reference
This project is built on the shoulder of stalwarts, a huge shout-out to all of them
- Rust maintainers for this awesome language
- The tauri app and its creators and maintainers
- Meta for creating LLaMA family of models and giving open-source AI a fair shot
- HuggingFace🤗 for everything they do, including Candle, distil-whisper and Tokenizer
- Georgi Gerganov for creating GGML/ GGUF movement
- Quant Factory Team for the LLaMA GGUF model files
- Svelte team and the creator Rich Harris for Svelte :)
And many, many more …