Part II: Voice Assistant Desktop App with LLaMA3 and Whisper in Rust

In this series we are on a mission to build our very own desktop native app that can interface with LLM LLaMA3-8B over text and audio instructions. In our pervious installment of the series we did our setup with Tauri 2.0 Beta and loaded our LLaMA3 and Whisper models using Huggingface Candle framework, then we ran our text generation pipeline. In this post we complete this journey with voice instructions and get our app to respond to our audio instructions.

The series:

Who is this for?

You’ll feel right at home if you are a programmer, have some exposure to Rust and a bit of experience working with Svelte, React or any other modern client-side framework.

Tools and Libraries

Rust - Install Rust
Tauri - A cross-platform desktop app toolkit built on Rust
SvelteKit - For the quick and simple UI
Meta LLaMA3 8B - we are going to be using a gguf version of LLaMA3 8B. gguf is fileformat created by Georgi Gerganov
Distil Whisper V3-large - which is the knowledge distilled version of OpenAI Whisper-large-v3
Candle a minimalist ML framework in rust by the awesome HuggingFace🤗 folks

TL;DR

Github Repo for this project

A quick glimpse of our app

Alert: contains audio

Let’s start working on our inference flow - text inference first and then audio.

Audio inference

With the text inference ready to rock, let’s move on to the Whisper based audio transcription which is going to be a lot more involved. So, gear up.

Let’s break down the steps to audio transcription.

We receive chunks of audio data from the frontend - we keep appending these chunks
When the recording stops, the frontend would call ask() command handler to initiate audio inference which will in-turn call the audio() method of our struct Instruct
This command is then passed on to a WhisperWrap method infer() which will actually run the transcription.

Note, these steps just generate the transcript of the audio and doesn’t interact with our LLaMA3 yet. We’ll work on that in the Pipeline phase of our processing.

Receiving audio data:

We’ve already exposed a tauri command fn audio_chunk() in audio-instruct/src-tauri/src/commands.rs, let’s modify that to actually send the chunks to the instance of our struct Instruct with the MPSC channel we had defined earlier.

audio-instruct/src-tauri/src/commands.rs

 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
/// This tauri command would receive a Vec<f32> which represents a chunk of audio being recorded
/// The chunk will be forwarded through the MPSC channel
#[tauri::command]
pub fn audio_chunk(
    app: tauri::State<'_, Arc<Instruct>>,
    req: ipc::Request<'_>
) -> Result<(), &'static str> {
    if let tauri::ipc::InvokeBody::Raw(data) = req.body() {
        let chunk = bytes_to_f32(&data[..]);

        if let Err(e) = app.send(chunk) {
            error!("audio_chunk: error: {e:?}");
            return Err("invalid chunk")    
        }
    } else {
        return Err("invalid chunk")
    }
    Ok(())
}

And, let’s also change our audio-instruct/src-tauri/src/instruct.rs method listen() to actually do some work.

audio-instruct/src-tauri/src/instruct.rs

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
impl Instruct {
    // .. code ommitted ..

    /// Exposes an API to send data into our MPSC channel
    pub fn send(&self, data: Vec<f32>) -> Result<()> {
        self.send.send(data)?;
        
        Ok(())
    }

    // This method just forwards the incoming chunks to a method exposed by our `struct WhisperWrap`. The client doesn't need to wait for this to happen
    fn listen(app: Arc<Instruct>, recv: Receiver<Vec<f32>>) {
        while let Ok(next) = recv.recv() {
            app.whisper.chunk(next);
        }
    }

    // .. code ommitted ..
}

Now, we’ll expose a method of struct WhisperWrap to accept the incoming chunks and update it’s field data.

audio-instruct/src-tauri/src/whisper.rs

265
266
267
268
269
270
271
272
273
274
275
276
277
impl WhisperWrap {
    // ... code ommitted ...

    /// Accepts an incoming chunk of data and appends it to our `data` field of the struct
    pub fn chunk(&self, chunk: Vec<f32>) {
        let mut c = chunk;
        let mut chunk = self.data.lock().unwrap();
        // while appending this also `drains` the incoming Vec<>, saving some space
        chunk.append(&mut c);
    }

    // ... code ommitted ...
}

That should do the trick, every time a new audio chunk is received this method just appends it to the field data.

Initializing inference

Once the audio recording stops, our tauri command ask() is called by the frontend but this time requesting an audio inference instead of a text inference. We’ll work on the calling from frontend bit of this flow later, for now, let’s just scope in for this. Our tauri::command ask() already derives the Mode of the incoming command and forwards it to Instruct method text() or Instruct method audio(). Let’s modify our audio() method to trigger the inference.

audio-instruct/src-tauri/src/instruct.rs

59
60
61
62
63
64
65
66
67
68
/// Public API to trigger audio inference
pub fn audio(&self) -> Result<Response> {
    let (transcript, n_tokens, elapsed) = self.whisper.infer()?;

    // More when we work on the `Pipeline` part of our inference

    Ok(
        Response::new(&transcript, &transcript, n_tokens, 0)
    )
}

Transcribe

Let’s start with the actual transcription now. Whisper is an encoder -> decoder architecture model vs LLaMA3 which is a decoder only model. In encoder -> decoder models the input data is first passed through the encoder part of a model; the output of the encoder is then fed into the decoder part of the model to generate tokens.

Note
A quick read about transformers and the various modules that are involved.

We’ve already discussed Mel Spectogram and our mel_filters before, we’ll use that now.

Preprocessing would involve using the candle api pcm_to_mel() (here pcm stands for pulse code modulation representation of the audio data and mel is Mel Spectogram representation).

audio-instruct/src-tauri/src/whisper.rs

143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
impl WhisperWrap {
    // ... code ommitted ..
    
    // 1. Checks if we have valid data
    // 2. Creates the `Mel Spectogram` representation of our audio data
    // 3. Creates and returns a `Tensor` from the given data
    fn preproc(&self) -> Result<Tensor> {
        let data = match self.data.lock() {
            Ok(mut d) => {
                if d.len() < 4096 * 4 {
                    anyhow::bail!("Not enough audio data in buffer!");
                }
                let d = d.drain(..).collect::<Vec<_>>();
                
                d
            }
            Err(e) => {
                error!("error acquiring data lock: {e:?}");
                anyhow::bail!("Not enough audio data in buffer!");
            }
        };

        let mel = pcm_to_mel(&self.config, &data[..], &self.mel_filters[..]);
        let mel_len = mel.len();
        let mel = Tensor::from_vec(
            mel,
            (1, self.config.num_mel_bins, mel_len / self.config.num_mel_bins),
            &self.device,
        )?;

        Ok(mel)
    }

    // ... code ommitted ..

    /// Runs transcription
    pub fn infer(&self) -> Result<(String, u32, std::time::Duration)> {
        // generates `mel`
        let mels = self.preproc()?;

        let mut model = match self.model.lock() {
            Ok(m) => m,
            Err(e) => {
                error!("infer: error acquiring model lock: {e:?}");
                anyhow::bail!("error during inference");
            }
        };

        let (_, _, content_frames) = mels.dims3()?;
        let mut seek = 0;
        let mut segments = vec![];

        // newline tokens after each segment
        let nltokens = self.tokenizer.encode("\n", false).unwrap().get_ids().to_vec();

        let mut total_dur = Duration::from_millis(0);
        let mut total_tokens = 0;

        // seek through the generated `mels` and call the `decode_segment` method on a chunk
        while seek < content_frames {
            let start = std::time::Instant::now();

            let segment_size = usize::min(content_frames - seek, N_FRAMES);
            let mel_segment = mels.narrow(2, seek, segment_size)?;

            let mut decoded = self.decode_segment(&mut model, &mel_segment)?;
            seek += segment_size;

            total_dur += std::time::Instant::now() - start;
            total_tokens += decoded.tokens.len();

            if decoded.no_speech_prob > NO_SPEECH_THRESHOLD && decoded.avg_logprob < LOGPROB_THRESHOLD {
                println!("no speech detected, skipping {seek} {decoded:?}");
                continue;
            }
            
            segments.append(&mut decoded.tokens);
            // adding newline tokens after each segment
            nltokens.iter().for_each(|&t| {
                segments.push(t);
            });
        }

        // Let us now create the final text output
        let instruct = self.tokenizer.decode(&segments, true).map_err(|_| anyhow!("error creating text from tokens"))?;

        Ok((instruct, total_tokens as u32, total_dur))
    }

    // Decodes a single segment at different `hyperparameter temperature`
    // The current values of the hyperparameter TEMPERATURES = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
    fn decode_segment(&self, model: &mut Whisper, segment: &Tensor) -> Result<DecodingResult> {
        // Decode at a particular temperature, check if we have a valid result or move on to the next temperature
        for (i, &t) in TEMPERATURES.iter().enumerate() {
            let decoded = self.decode(model, segment, t);
            if i == TEMPERATURES.len() - 1 {
                return decoded;
            }

            match decoded {
                Ok(decoded) => {
                    if decoded.avg_logprob >= LOGPROB_THRESHOLD || decoded.no_speech_prob > NO_SPEECH_THRESHOLD {
                        info!("Decoded: {decoded:?}");
                        return Ok(decoded);
                    } 
                }
                Err(e) => {
                    warn!("Error decoding @ temperature: {t}: {e:?}");
                }
            }
        }

        unreachable!()
    }
}

Let’s look at what is happening above, the logical segments of the mel representation is passed on to a method decode_segments(). For each mel segment we are attempting to generate inference on various pre-set temperatures, a hyperparameter used by whisper for decoding.

Basically, we are trying to get a valid result for each segment across the different temperature values, the validity is defined by the some more hyperparameters like LOGPROB_THRESHOLD (sort of how sure is the model) and NO_SPEECH_THRESHOLD (what are the chances of this segment being random noise?).

Inside self.decode the mel representation is passed through the encoder and then after some processing on to the decoder part of our model. Let’s jump into that, thats the crux of this whole process.

audio-instruct/src-tauri/src/whisper.rs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// inside impl WhisperWrap
fn decode(&self, model: &mut Whisper, segment: &Tensor, temp: f64) -> Result<DecodingResult> {
    // generating some random seed
    let mut rng = rand::thread_rng();

    // the `mel` segment block is passed through the encoder here, you can think of this process as generating `features` from your input audio data segment
    let features = model.encoder.forward(segment, true)?;

    // some token pre-generation, basically creating a `token` representation which tells the `decode` part of the model [start of transcript, english, task transcribe, don't generate timestamps] - these special tokens kind of act like a prompt for the decoder model
    let mut tokens = self.preproc_decode();

    // now, we initialize some variables that will maintain our stats and metrics derived from this part of the inference
    // probability of this segment being some random/ background noise
    let mut no_speech_prob = f64::MAX;
    // the average probability of our prediction across all the decoding passes
    let mut sum_log_p = 0.;

    // Ok, so now we loop through a `max number of possible outputs` and `autoregressively` generate the next token
    for i in 0..self.config.max_target_positions {
        // we'll comvert the `Vec<tokens>` slice to a `Tensor`
        let tensor = Tensor::new(&tokens[..], &self.device)?;
        // and send that tensor to the `decoder` to generate the next token
        let dec = model.decoder.forward(&tensor.unsqueeze(0)?, &features, i == 0)?;

        // Extract the no speech probability on the first iteration by looking at the first
        // token logits and the probability for the according token.
        if i == 0 {
            let logits = model.decoder.final_linear(&dec.i(..1)?)?.i(0)?.i(0)?;
            no_speech_prob = softmax(&logits, 0)?
                .i(self.default_tokens.no_speech as usize)?
                .to_scalar::<f32>()? as f64;
        }

        let (_, seq_len, _) = dec.dims3()?;
        let logits = model.decoder.final_linear(&dec.i((..1, seq_len - 1..))?)?
            .i(0)?
            .i(0)?;

        // a simple sampler, picked up from `https://github.com/huggingface/candle/blob/main/candle-examples/examples/whisper/main.rs`
        let next_token = if temp > 0. {
            let prs = softmax(&(&logits / temp)?, 0)?;
            let logits_v: Vec<f32> = prs.to_vec1()?;
            let distr = rand::distributions::WeightedIndex::new(&logits_v)?;
            distr.sample(&mut rng) as u32
        } else {
            let logits_v: Vec<f32> = logits.to_vec1()?;
            logits_v
                .iter()
                .enumerate()
                .max_by(|(_, u), (_, v)| u.total_cmp(v))
                .map(|(i, _)| i as u32)
                .unwrap()
        };

        // remember, that this decoding is `autoregressive`, meaning, output of the current pass is passed on as input to the next pass till some stop condition is reached
        tokens.push(next_token);

        // bookkeeping the `probability`, we'll calculate average on this later and that would serve as our `decisioning` benchmark comparing with `LOGPROB_THRESHOLD`
        let prob = softmax(&logits, candle_core::D::Minus1)?
            .i(next_token as usize)?
            .to_scalar::<f32>()? as f64;

        // the stop condition, if end-of-token or max target has been reached
        if next_token == self.default_tokens.eot || tokens.len() > self.config.max_target_positions {
            break;
        }
        sum_log_p += prob.ln();
    }
    
    // Finally create a struct to hold our output and metadata for this decoding pass
    Ok(DecodingResult {
        text: self.tokenizer.decode(&tokens, true).map_err(|_| anyhow!("error creating text from tokens"))?,
        avg_logprob: sum_log_p / tokens.len() as f64,
        tokens,
        no_speech_prob,
        temperature: temp,

    })
}

Code snippet above (please read through the comments for details) can be summarized into the following broad steps:

We pass our mel through the encoder of our model to generate features - input for the decoder
We generate a set of tokens (language, task etc.)
We go into an autoregressive loop and use our tokens and mel features - each loop results in a new token and we append it to our tokens
Loop is broken when we hit some conditions like end of transcript token
We also keep maintaining some metrics which we use to decide if the current segment is a voice segment or noise

That’s it! we have our transcript ready. To test this out we can write a test case to read an audio file, convert it to pcm then mel spectogram etc.. but that’s rather time consuming and since we are not going to read a file for our final flow, I’d rather finish off the frontend to test this out.

The Interface

As we did last time, we already have a svelte frontend ready and not a lot has changed from the previous setup. Let’s move on to the unique and juicy bits of the current flow.

Let’s take a moment to figure out what are the things and capabilities that need to change in the client side to accommodate audio instructions from our existing text only inference!

This is a pre-requisite and not strictly a part of the flow. We’ll need to setup the right permissions so that our client side can access and find the microphones. a. [TODO] figure this out for Tauri 2.0 Beta
First, of course we need a way of capturing the audio, since our rust backend expects pcm encoded Vec<f32> input, we’ll need a way of converting the audio waveforms to our Vec<f32>
We need a way of sending the audio converted to Vec<f32> in chunks
We need a way of stopping the recording and once stopped our client will should send across a command ask() to the backend to process the audio instructions, which means we’ll end up modifying our call to ask to support both the use cases; text and audio

We’ll add a button alongside our text input and on click of that trigger toggle record / stop record functionality.

audio-instruct/src/routes/+page.svelte

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
<script lang="ts">
    // Adapted from https://github.com/kgullion/vite-typescript-audio-worklet-example/blob/main/src/main.ts
    import audioProcUrl from "$lib/audio-proc/audio-processor?url";
    import { invoke } from '@tauri-apps/api/core';
    import type { Inference, QuestionAnswer } from "$lib/types";
    import Qa from "QA.svelte";

    // defining a constant buffer size for chunking audio
    const BUFFER_SIZE = 4096;
    // sampling rate for whisper 16KHz
    const SAMPLE_RATE = 16000; // Whisper typically expects 16kHz audio

    // ... code ommitted ...

    // a variable to hold the media stream
    let stream: MediaStream|null = null;

    // an audio context 
    let audioContext: AudioContext|null = null;
    let source: MediaStreamAudioSourceNode|null = null;
    let workletNode: AudioWorkletNode|null = null;
</script>

Ok, some definitions and explanations

let stream: MediaStream
The MediaStream interface of the Media Capture and Streams API represents a stream of media content. A stream consists of several tracks, such as video or audio tracks.
– MDN MediaStream

let audioContext: AudioContext
The AudioContext interface represents an audio-processing graph built from audio modules linked together, each represented by an AudioNode.
An audio context controls both the creation of the nodes it contains and the execution of the audio processing, or decoding. You need to create an AudioContext before you do anything else, as everything happens inside a context. It’s recommended to create one AudioContext and reuse it instead of initializing a new one each time, and it’s OK to use a single AudioContext for several different audio sources and pipeline concurrently.
– MDN AudioContext

Simply put, one or multiple audio devices or streams, source and destination are connected via a conceptual AudioNode and they are all logically bound together by this AudioContext interface.

let workletNode: AudioWorkletNode

This is an interface which represents the base class for a user defined AudioNode. This would have an AudioWorkletProcessor where the actual processing happens BUT, in the browser’s (separate from main thread if I get this right) Web Audio Rendering thread, which makes it pretty efficient and cool.

Ok, now that we understand what those heavy hitting words are for, let’s move on. Below we define the functions to record, stopRecord and toggleRecord. Read the comments for more details.

 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
<script lang="ts">
    import audioProcUrl from "$lib/audio-proc/audio-processor?url"; // audio worklet defined and exported from here

    // ... code ommitted ...

    // begins recording of a new stream
    const record = async () => {
        if(stream) {
            console.error("Duplicate record??");
            return;
        }

        // requests for `audio` user media using this navigator API. For the first run, this will ask for the pemission to grant access to microphone
        stream = await navigator.mediaDevices.getUserMedia({ audio: true });

        // Create AudioContext with our 16KHz sample rate
        audioContext = new AudioContext({ sampleRate: SAMPLE_RATE });

        // Load and register the audio worklet
        // Worker file loaded as a module
        await audioContext.audioWorklet.addModule(audioProcUrl)

        // Create MediaStreamSource
        source = audioContext.createMediaStreamSource(stream);

        // Create AudioWorkletNode - the `audioProcUrl` content is attached to the execution context
        // Like a pipe, the audio-stream would pass through this transformation
        workletNode = new AudioWorkletNode(audioContext, 'audio-processor', {
            outputChannelCount: [1],
            processorOptions: {
                bufferSize: BUFFER_SIZE
            }
        });

        // Connect the nodes
        source.connect(workletNode);
        workletNode.connect(audioContext.destination);

        // Set up message handling from the audio worklet
        workletNode.port.onmessage = handleAudioData;
    }

    // the output chunk of the AudioWorkerNode is passed on to this function 
    // and this function `emits` the audio_chunk to the `backend` using `tauri::comand audio_chunk()`
    const handleAudioData = async (event: MessageEvent): Promise<void> => {
        const float32Array = event.data as Float32Array;

        invoke("audio_chunk", float32Array);
    }

    // stops the recording, invokes `ask()` tauri command with indication that we are going to process audio
    // then, cleans up all the audio related instances and objects
    const stopRecord = async () => {
        goAskAudio();

        if(workletNode) {
            workletNode.disconnect();
            workletNode = null;
        }

        if(source) {
            source.disconnect();
            source = null;
        }

        if(audioContext) {
            audioContext.close();
            audioContext = null;
        }

        if(stream) {
            stream.getTracks().forEach(t => t.stop());
            stream = null;
        }

        recordstart = null;
    }

    // toggle start/ stop recording
    const toggleRecord = async () => {
        isrecording = !isrecording;
        if(isrecording) {
            recordstart = new Date();
            record();
        } else {
            stopRecord()
        }
    }

    // ... code ommitted ...

    // prepares to ask audio inference and calls `tauri ask()` command with `audio: true` and `text: undefined`
    const goAskAudio = async () => {
        asking = true;
        // We are just using a simple keyword to 
        qas.push({ q: "..", a: "__asking__", ts: new Date() });
        question = "";

        qas = [...qas];

        // The inference generation is extremely resource intensive, giving our UI to update before the call
        setTimeout(() => {
            command(undefined, true)
        }, 100)
    }
</script>

Now, let’s look at the AudioWorker file we have been talking about.

audio-instruct/src/lib/audio-proc/audio-processor.ts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Adapted from https://github.com/kgullion/vite-typescript-audio-worklet-example/blob/main/src/main.ts

class AudioProcessor extends AudioWorkletProcessor {
    private bufferSize: number;
    private buffer: Float32Array;
    private bufferIndex: number;

    constructor(options?: AudioWorkletNodeOptions) {
        super();
        this.bufferSize = options?.processorOptions.bufferSize || 4096;
        this.buffer = new Float32Array(this.bufferSize);
        this.bufferIndex = 0;
    }

    // this method is basically responsible for an input buffer and creating a output chunked buffer of specific size
    process(inputs: Float32Array[][], outputs: Float32Array[][], parameters: Record<string, Float32Array>): boolean {
        const input = inputs[0];
        const channel = input[0];

        if (channel) {
            for (let i = 0; i < channel.length; i++) {
                this.buffer[this.bufferIndex++] = channel[i];

                if (this.bufferIndex === this.bufferSize) {
                    this.port.postMessage(this.buffer);
                    this.buffer = new Float32Array(this.bufferSize);
                    this.bufferIndex = 0;
                }
            }
        }

        return true;
    }
}

registerProcessor('audio-processor', AudioProcessor);

Ok, that should do the trick, it’s time to check try this out!

RUST_LOG=info npm run tauri dev --release -- --features metal

And in a few seconds, we should see …

There you go, if you play the video you’ll hear my awkward voice, but you’ll also see the transcript. Our audio workflow WORKS 🎉🎉🎉!

The Pipeline

So far, we have our text inference up and running and audio transcript doing its job. Let’s tie them up together in this Pipeline. We are simply going to pass on the transcript as a part of our text inference prompt as an instruction.

audio-instruct/src-tauri/src/instruct.rs

64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
impl Instruct {
    // .. code ommitted ..


    /// Public API to trigger audio inference
    pub fn audio(&self) -> Result<Response> {
        let (transcript, n_tokens, elapsed) = self.whisper.infer()?;
        let (generated, n_txt_tok, txt_elapsed) = self.llama.infer(&transcript)?;

        Ok(Response::new(
            &transcript,
            &generated,
            (n_tokens + n_txt_tok) as u32,
            (elapsed + txt_elapsed).as_secs(),
        ))
    }
}

And that’s it. Your Personal Voice Assistant is ready for your command.

Wrapping up & Next steps

That was a lot of information! But if you have reached here you have done GREAT and you have my thanks and congratulations. You’ve captured audio from a WebView frontend, pcm encoded the chunks of the waveform, emitted chunks to the backend, ran mel transformations on it, made predictions with 2 models all in your own computer.

What next from here you ask? Here are some ideas …

We hardcoded our audio model to work with English only, try a multi-lingual model. That would involve a phase of language detection
Create commands out of your instructions - open notes could open the note taking or text edit app
Implement push-to-talk - global listeners of your app could launch this app and always be there for you
This post is already very long, work on the build and distribution of the app. The Tauri 2.0 documentation and our previous series could help

Use your imagination … and if you build something cool, I’d love to hear about it!

Before we close today …

If you have found this post helpful consider spreading the word, it would act as a strong motivator for me to create more. If you found an issue or hit a snag, reach out to me @beingAnubhab.

Acknowledgements & reference

This project is built on the shoulder of stalwarts, a huge shout-out to all of them

Rust maintainers for this awesome language
The tauri app and its creators and maintainers
Meta for creating LLaMA family of models and giving open-source AI a fair shot
HuggingFace🤗 for everything they do, including Candle, distil-whisper and Tokenizer
Georgi Gerganov for creating GGML/ GGUF movement
Quant Factory Team for the LLaMA GGUF model files
Svelte team and the creator Rich Harris for Svelte :)

And many, many more …

The series:#

Who is this for?#

Tools and Libraries#

TL;DR#

A quick glimpse of our app#

Audio inference#

Receiving audio data:#

Initializing inference#

Transcribe#

The Interface#

The Pipeline#

Wrapping up & Next steps#

Before we close today …#

Acknowledgements & reference#

The series:

Who is this for?

Tools and Libraries

TL;DR

A quick glimpse of our app

Audio inference

Receiving audio data:

Initializing inference

Transcribe

The Interface

The Pipeline

Wrapping up & Next steps

Before we close today …

Acknowledgements & reference