Part II: Desktop QA Assistant With Meta LLaMA3 in Rust

In the spirit of AI for everyone we started building quick desktop application to interface with the very capable LLaMA3 8B model by Meta AI in our previous post.

The series:

Goals

We completed the basic setup in our previous post. In this part of the series, we’ll load the model and run our first AI inference.

At the end of the series, we should have an application that looks like this: final

Who is this for?

You’ll feel right at home if you are a programmer, have some exposure to Rust and a bit of experience working with Svelte, React or any other modern client-side framework.

Tools and Libraries

Rust - Install Rust
Tauri - A cross-platform desktop app toolkit built on Rust
SvelteKit - For the quick and simple UI
Llama.cpp - another micro revolution of democratizing AI Models spearheaded by Georgi Gerganov
llama_cpp-rs - a rust library that provides simple, high-level bindings over llama.cpp

TL;DR

Github Repository

Backend

Let’s Plan

We’ll need a struct to instantiate the app. This will initialize the model and run inference on the incoming instructions.
We’ll require a way of communicating our instructions and inference between the frontend and the backend.
A way of downloading the model and storing it somewhere during the first launch.
Some logging and handling errors without creating a mess. For logging my default go-to crates are a combination of pretty_env_logger and log, for errors I doubt there’s anything simpler than the awesome anyhow crate by the master David Tolnay.

Add these crates to our instruct/src-tauri/Cargo.toml.

instruct/src-tauri/Cargo.toml

17
18
19
20
21
22
23
[dependencies]
anyhow            = { version = "1" }
log               = { version = "0" }
pretty_env_logger = { version = "0" }
serde             = { version = "1.0", features = ["derive"] }
serde_json        = "1.0"
tauri             = { version = "1.7.0", features = [] }

Instantiation

Let’s create a new file app.rs inside the dir instruct/src-tauri/src and stub out a empty struct Instruct.

instruct/src-tauri/src/app.rs

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
/// This struct will hold our initialized model and expose methods to process incoming `instruction`
pub struct Instruct {
	// .. fill this later ..
}

impl Instruct {
    // a constructor to initialize our model and download it if required
    pub fn new() -> Self {
        Self {}
    }
}

It’s obvious that our struct Instruct doesn’t do anything yet, but stubbing out the basic building blocks often helps me conceptualize and build faster. We’ll see how this unfolds.

instruct/src-tauri/src/main.rs

 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// telling the rust compiler that we using macros from `log` crate - with this we can use `warn!()`, `info!()`, `debug!()` etc.
// in our code instead of having to import them specifically everywhere.
#[macro_use]
extern crate log;

use app::Instruct;

mod app;

fn main() {
  // Initialize logger
  pretty_env_logger::init();

  // Initialize our backend - our application state
  let instruct = Instruct::new();

  let app = tauri::Builder::default()
    // let's tell Tauri to manage the state of our application
    .manage(instruct)
    // We'll have our handlers (handles incoming `instructions` or `commands`)
    .invoke_handler(tauri::generate_handler![
      // The handlers don't do anything yet, for now, we are scoping this in
    ])
    // telling tauri to build
    .build(tauri::generate_context!())
    // .. build but fail on error
    .expect("error running app");

    // finally, lets run our app
    app.run(|_app_handle, event| {
        if let tauri::RunEvent::ExitRequested { api, .. } = event {
            warn!("exit requested {api:?}");
        }
    });
}

Let’s run the app once more to ensure everything is fine so far, and now that we have a basic logging set up lets run it with the prefix RUST_LOG=info

RUST_LOG=info cargo tauri dev

Our app should compile and pull up an app window once more with the default Welcome message from SvelteKit.

While outlining our plan we had decided that we’ll load (possibly download) the model when we instantiate our struct Instruct, but so far it is just an empty shell. Let’s change that.

As we scoped out early in this post, we are going to be using the rust library llama-cpp-rs to interface with the gguf model for inference. Let’s add them to our instruct/src-tauri/Cargo.toml.

instruct/src-tauri/Cargo.toml

16
17
18
19
20
[dependencies]
anyhow            = { version = "1" }
llama_cpp         = { version = "0", features = ["metal"] } # notice that I've opted for `metal` feature because I'm working on a Mac M1. If you are running this on a `cuda` capable GPU use feature `cuda`.
log               = { version = "0" }
...

Hint
Check llama_cpp features for available backends

A quick introduction to GGML/ GGUF

GGML (Generalized Gradient Machine Learning) is a fast, web-friendly and powerful tensor library for machine learning without any external dependencies. It enables large models and high performance on regular computers, with features like:

Efficient computation
Reduced memory usage
Easy model training
Optimizations for various architectures

Quoting the creators on GGUF:

GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.
— GGML Repo

To summarize, GGML tensor library and GGUF file format just empowers us to execute ML model inference in resource constrained environments without very little additional setup.

Resources
A few resources to dig deeper into GGML/ GGUF:
The GGML Repo
GGUF in HuggingFace🤗
Lamma.cpp repo
Georgi Gerganov - creator of GGML
ggml.ai

The Model

To load a model for inference we’ll need the following steps:

Step 1. Defining a directory to hold the model data: Let’s call it `app_data_dir`

Tauri provides an API tauri::api::path::data_dir which resolves to the conventional directory where apps can offload their data. To further distinguish a data directory specific to our app we’ll simply give our app a scope; basically, suffix the default data directory with a namespace unique to our app. Let’s write a few utility functions for this and use it during our launch.

instruct/src-tauri/src/utils.rs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
use std::{fs::create_dir_all, path::{Path, PathBuf}};

use anyhow::{anyhow, Result};
use tauri::api::path::data_dir;

// This const has been added to main to be used as a suffix of our tauri app data directory
// We'll see this in our next code section
// defined in main.rs
use crate::APP_PACKAGE;

/// Helper function to return our app's data directory and create it if it doesn't exist
/// This should basically run just once
pub fn app_data_dir() -> Result<PathBuf> {
    let data_dir = if let Some(d) = data_dir() {
        // Creating an unique name for our app - easier to maintain
        d.join(APP_PACKAGE)
    } else {
        error!("tauri didn't return any data directory");
        return Err(anyhow!("error reading data directory"));
    };

    // if data dir doesn't exist, create it
    create_dir_if_not_exists(&data_dir)?;

    Ok(data_dir)
}

/// helper function to create a directory if it doesn't exist
pub fn create_dir_if_not_exists(p: &Path) -> Result<()> {
    if p.is_dir() {
        return Ok(());
    }

    create_dir_all(p)?;

    Ok(())
}

And we modify our main() function to call this at launch.

instruct/src-tauri/src/main.rs

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
use utils::app_data_dir;

mod app;
mod utils;

// Just a suffix for our data directory
pub const APP_PACKAGE: &str = "instruct.llm";

fn main() {
  // initialize the logger
  pretty_env_logger::init();

  // Lets check and create our application data directory
  {
    let a_dir = app_data_dir().expect("error creating data directory");
    info!("Data Directory: {:?}", a_dir);
  }

  // ... rest of the code as is ...
}

Step 2. Check if our model file exists in our `app_data_dir` or download it

We define the HuggingFace🤗 model repo and model file as constants and call a function to check if file exists, or download.

instruct/src-tauri/src/main.rs

 8
 9
10
11
12
/// We are going to be using the q8 variant of LLaMA3 8B parameter instruct model.
/// This model would require around ~9GB VRAM/ RAM to operate
/// Technically, our app should work with most `gguf` models, check HuggingFace for q5, q4 variants or other models
const MODEL_REPO: &str = "QuantFactory/Meta-Llama-3-8B-Instruct-GGUF";
const MODEL_FILE: &str = "Meta-Llama-3-8B-Instruct.Q8_0.gguf";

Let’s define a function to check if the model file exists or download for us. We’ll use the HuggingFace🤗 hf-hub crate to do the downloading for us. We’ll add this crate to our Cargo.toml and use it in our download_model() function.

instruct/src-tauri/src/app.rs

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
impl Instruct {
    /// A constructor to initialize our model and download it if required
    // Note, our constructor now returns `anyhow::Result<Self>` instead of just `Self`
    // This is because, we may hit a snag while trying to download or initialize the model
    pub fn new() -> Result<Self> {
        // Check for model path.
        // We are going to be using the `data` directory that tauri provides for this. This helps you standardize and align with best-practices
        let path = Self::model_path()?;

        // ... we'll work on this soon ...

        Ok(Self {

        })
    }

	// This associated function will look for a model path in `tauri` provided data directory
    // If it's not found, it'll attempt to download the model from `huggingface-hub`
    fn model_path() -> Result<PathBuf> {
        let app_data_dir = app_data_dir()?;

        let model_path = &app_data_dir.join(MODEL_FILE);

        // The model file doesn't exist, lets download it
        if !model_path.is_file() {
            Self::download_model(model_path)?;
        }
        
        Ok(model_path.to_owned())
    }
	
    // The model doesn't exist in our data directory, we'll download it in our `app_data_dir`
    fn download_model(dir: &Path) -> Result<()> {
        let path = ApiBuilder::new()
            .with_cache_dir(dir.to_path_buf())
            .with_progress(true)
            .build()?
            .model(MODEL_REPO.to_string())
            .get(MODEL_FILE)?;

        info!("Model downloaded @ {path:?}");

        // The downloaded file path is actually a symlink
        let path = fs::canonicalize(&path)?;
        info!("Symlink pointed file: {path:?}");
        // lets move the file to `<app_data_dir>/<MODEL_FILE>`, this will ensure that we don't end up downloading the file on the next launch
        // This not required, but just cleaner for me to look at and maintain :)
        std::fs::rename(path, dir.join(MODEL_FILE))?;
        
        // We'll also delete the download directory created by `hf` -- this adds no other value than just cleaning up our data directory
        let toclean = dir.join(
            format!("models--{}", MODEL_REPO.split("/").collect::<Vec<_>>().join("--"))
        );
        std::fs::remove_dir_all(toclean)?;

        Ok(())
    }
}

Let’s run the app once more to verify if everything has gone as per plan.

RUST_LOG=info cargo tauri dev

It should now download the model file and move it to its destination in <app_data_dir>/portal.llm/Meta-Llama-3-8B-Instruct.Q8_0.gguf. Downloading Model

Note
The q8 variant of this model is ~8GB, depending on your internet speed, this might take a while

Step 3: Loading the model

Loading the model is simple using llama_cpp crate as it does a lot of heavy lifting for us. We simply use the llama_cpp::LlamaModel::load_from_file() API with the path to our downloaded model and default params llama_cpp::LlamaParams::default().

Hint
The llama_cpp::LlamaParams struct allows us to configure (among other things) the number of Transformer Layers you want to load in your GPU memory. In a resource constrained environment, we may choose to offload some transformer layers from the model to CPU. CPU offloading is great because it enables us to use models larger than our VRAM or GPU Memory but comes at the cost of inference speed. In our current project I’m working with Apple M1 Pro with 16GB of unified memory, a q8 model should load just fine.

We’ll also modify our struct Instruct to hold the instance of the LlamaModel. Our initialization code will start to look something like the following:

instruct/src-tauri/src/app.rs

15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/// This struct will hold our initialized model and expose methods to process incoming `instruction`
pub struct Instruct {
	/// Holds an instance of the model for inference
    model: LlamaModel
}

impl Instruct {
    // a constructor to initialize our model and download it if required
    pub fn new() -> Result<Self> {
        // Check for model path.
        // We are going to be using the `data` directory that tauri provides for this. This helps you standardize and align with best-practices
        let path = Self::model_path()?;

        // Initialize the model
        let model = LlamaModel::load_from_file(path, LlamaParams::default())?;
        
        Ok(Self {
            model
        })
    }

	// .. other functions and methods ..
}

Now, NOW, NOW .. let’s run the app once more.

RUST_LOG=info cargo tauri dev

… drumrolls … THE MODEL LOADS

If you have come this far, congratulate yourself, most of the work is done and you have done great!

Handlers

We need some mechanism of communication between the front-end interface and the backend, the front-end will capture the user’s question, transmit it to the backend for the LLM inference, receive the response and display it. This is starting to look like the classical and familiar server <> client architecture, isn’t it?

If your answer is YES that’s because it’s exactly that - a server <> client architecture. Tauri achieves this communication with something called Commands, this enables the client to invoke rust functions from the backend. I often imagine commands to be like rPC calls for wrapping my head around the concepts (technically they are probably not the same).

Let’s create command handler, we’ll use this to pass our question/ instruction from the client interface to the backend.

We’ll create a new file instruct/src-tauri/src/commands.rs and of course declare it as mod commands; in our main.rs.

instruct/src-tauri/src/main.rs

10
11
12
13
// Our `mod` declarations look like the following at this stage
mod app;
mod utils;
mod commands;

Now, let’s add the command handler in the new command.rs file. Notice the tauri::command macro, this marks the function as a command handler and wraps it with the necessary glue code. An important note, the data types used in the input and return tauri::command annotated function MUST be Serializable.

instruct/src-tauri/src/commands.rs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
/// A struct to hold the response data and some stats or metadata required to show the inference
#[derive(Debug, Serialize, Deserialize)]
pub struct Response {
    text: String,
    meta: Meta
}

/// A struct to hold some metadata and additional information about the QA/ Response/ Instruction etc.
#[derive(Debug, Serialize, Deserialize)]
pub struct Meta {}

#[tauri::command]
pub fn ask(
    state: tauri::State<'_, Instruct>,
    text: &str,
) -> Result<Response, &'static str> {
    // we'll work on this soon
    todo!()
}

Note
Notice how we are passing the app state Instruct through the state argument of the function, tauri does the heavy lifting for this magic and you’ll find this pattern in most rust web frameworks - axum and others.

We’ll update the tauri initialization in main() function to account for this command handler.

instruct/src-tauri/src/main.rs

12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
fn main() {
    // .. code omitted ..
    let app = tauri::Builder::default()
        // let's tell Tauri to manage the state of our application
        .manage(instruct)
        // We'll have our handlers (handles incoming `instructions` or `commands`)
        .invoke_handler(tauri::generate_handler![
            ask // imported from `commands.rs` module
        ])
        // telling tauri to build
        .build(tauri::generate_context!())
        // .. build but fail on error
        .expect("error running app");

    // .. code omitted ..
}

Inference

By now, we have most of the blocks ready for the backend to work. We’ll focus on the processing logic and inference for the time being.

The flow will work as follows:

Step 0. [CLIENT, LATER] The user keys in a Question/ Instruction
Step 1. The `handler ask()` receives this `command`
Step 2. The `ask()` function must delegate the inference to some method of `struct Instruct` because that struct is the custodian of the actual model
Step 3. `ask()` function receives the inference or generated text as a `return` statement and `returns` it to the client
Step 4. [CLIENT, LATER] User sees the response of his Question/ Instruction

We’ll be working on Step 1, Step 2 and Step 3. Let’s begin by creating a method infer() for the struct Instruct which will do the real inference for us.

instruct/src-tauri/src/app.rs

22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
impl Instruct {
    // .. code omitted ...

    /// a method to run inference with the loaded model
    pub fn infer(&self, cmd: &str) -> Result<String> {
        // First, we'll create a new session for this request
        // A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
        // several gigabytes large, a session is typically a few dozen to a hundred megabytes!
        let mut ctx = self.model.create_session(SessionParams::default())?;

        // Now, we feed the prompt, the crate is taking care of the tokenization and other details from us here
        ctx.advance_context(cmd)?;
        
        // LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
        let max_tokens = 1024;
        let mut decoded_tokens = 0;

        // `ctx.start_completing_with` creates a worker thread that generates tokens. When the completion
        // handle is dropped, tokens stop generating!
        let completions = ctx.start_completing_with(StandardSampler::default(), max_tokens)?.into_strings();

        // This is for debugging as of now, we'll change this soon
        for completion in completions {
            print!("{completion}");
            let _ = std::io::stdout().flush();
            
            decoded_tokens += 1;
            
            if decoded_tokens > max_tokens {
                break;
            }
        }

        Ok(String::new())
    }
}

We start by creating a inference context or session for an incoming command, feed the incoming prompt. Then, we are setting the max number of tokens to generate and just keeping a track of the decoded tokens.

Then we trigger the generation loop and provide it with a default Sampler.

Note about Sampling
Auto-regressive models (like GPT, LLaMA …) generate a probability for every token in the vocabulary as its output. Sampling is the method used to choose the nearly correct next token. E.g. A top-p sampling would basically use a cutoff to select a set of tokens in the range of top probability and select one of the valid candidates, top-k would simply select k tokens from a list of tokens sorted by probability and select one from that set and so on. Today, sampling is often done as a chain of operations instead of just one technique.
Read more about sampling techniques and how it works.

Excercise
Ditch the default Sampler and experiment with various Sampler chains. You can also have a lot of fun figuring out how Sampling techniques impact the generation.

You may notice that we are not really returning any real information from our infer() method, we’ll change that. But before going there, lets write a test to see if our inference flow works.

instruct/src-tauri/src/app.rs

116
117
118
119
120
121
122
123
124
125
126
127
128
#[cfg(test)]
mod tests {
    use super::*;

    // A simple test to check if inference works
    #[test]
    fn test_inference() -> Result<()> {
        let app = Instruct::new()?;
        app.infer("What is the book `the hitchhiker's guide to the galaxy` all about?`")?;

        Ok(())
    }
}

It’s a simple test, we just initialize our struct Instruct and call the infer method with a prompt. Let’s run it.

cd src-tauri
RUST_LOG=info cargo test --release -- --nocapture
cd ..

And there you go, your first desktop inference.

First Inference

If you follow the text, you’ll realize that though its coherent with our question but not quite what we expect it to be. The issue here is that we are using an instruction tuned model not the base text completion one. Instruction tuned models expect input prompt in a certain format. Let’s try and fix this.

Let’s write a helper function to create a instruct template from our incoming command.

instruct/src-tauri/src/utils.rs

36
37
38
39
40
41
42
43
/// Helper function to convert incoming `command` to templated prompt
/// Prompt template: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L202
pub fn prompt(txt: &str) -> String {
    format!(
        "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a knowledgeable, efficient, intelligent and direct AI assistant. Provide concise answers, focusing on the key information needed. Respond only with the answer to the instruction based on the given data. Do not add any additional text, introduction, context or explanation. If you are unsure about an answer, truthfully return \"Not Known\".<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        txt
    )
}

This will convert an incoming text to a structure that LLaMA3 is trained to understand and follow, lets break it down:

<|begin_of_text|> -- a special token that tells the model that our text starts here
<|start_header_id|>role: (`user` | `system` | `assistant`)<|end_header_id|> - this part defines who is providing the instruction or text. The `assistant` persona is used for the model to use and sort of serves as a trigger for the model to fill in.
.. text associated with the role ..
<|eot_id|> - the `end of turn` special token which tells the model that one `turn` of input has ended. This is especially useful for `multi-tern` conversations and also serves as one of our `stop` or `break` tokens

Now, we’ll use this helper function in out infer() method to convert our incoming text command to a templated prompt. We’ll also modify the infer() method to break on certain tokens.

instruct/src-tauri/src/app.rs

 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
impl Instruct {
    // .. code omitted ...

    /// a method to run inference with the loaded model
    pub fn infer(&self, cmd: &str) -> Result<String> {
        // LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
        let max_tokens = 1024;
        let mut decoded_tokens = 0;

        let prompt = prompt(cmd);
        
        // First, we'll create a new session for this request
        // A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
        // several gigabytes large, a session is typically a few dozen to a hundred megabytes!
        let mut ctx = self.model.create_session(SessionParams {
            seed: 42,
            n_ctx: max_tokens,
            n_batch: 1,
            n_seq_max: max_tokens,
            ..Default::default()
        })?;

        // Now, we feed the prompt, the crate is taking care of the tokenization and other details from us here
        ctx.advance_context(prompt)?;

        // `ctx.start_completing_with` creates a worker thread that generates tokens. When the completion
        // handle is dropped, tokens stop generating!
        let completions = ctx.start_completing_with(StandardSampler::default(), max_tokens as usize)?;
        
        // Early stopping - break when you reach max token, `end-of-sequence` id or `end-of-turn` id
        for token in completions {
            decoded_tokens += 1;
            
            if token == self.model.eos()
                || token == self.model.eot()
                || decoded_tokens > max_tokens
                // HACK: This is a hack because `model.eot_id` seems to be giving wrong ID
                // So, I'm hardcoding it to the known `eot_id`
                // Refer: https://github.com/vllm-project/vllm/issues/4180
                || token == Token(128009) {
                break;
            }

            print!("{}", self.model.token_to_piece(token));
            let _ = std::io::stdout().flush();
        }
        println!("---");

        Ok(String::new())
    }
}

Let’s test this again ..

cd src-tauri
RUST_LOG=info cargo test --release -- --nocapture
cd ..

And … WALLAH .. it works, a smart summary to our question test What is the book 'A Hitchhiker's guide to the galaxy' all about? Working inference

Note
If you notice the code above, you’ll realize that I’ve hardcoded a specific eot_id to 128009. That’s because self.model.eot_id() is not giving us the correct token id, not sure why!!!

It’s pretty straight-forward from here on to call the infer() method from out ask() handler and get the result string.

We add a little more details to our struct Response and modify our ask() handler:

instruct/src-tauri/src/commands.rs

 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/// A struct to hold the response data and some stats or metadata required to show the inference
#[derive(Debug, Serialize, Deserialize)]
pub struct Response {
    text: String,
    meta: Meta
}

/// A struct to hold some metadata and additional information about the QA/ Response/ Instruction etc.
#[derive(Debug, Serialize, Deserialize)]
pub struct Meta {
    // number of tokens generated
    n_tokens: u32,
    // number of seconds elapsed
    n_secs: u64
}

impl Response {
    pub fn new(txt: &str, n_tokens: u32, n_secs: u64) -> Self {
        Self {
            text: txt.to_string(),
            meta: Meta { n_secs, n_tokens }
        }
    }
}

#[tauri::command]
pub fn ask(
    app: tauri::State<'_, Instruct>,
    text: &str,
) -> Result<Response, &'static str> {
    match app.infer(text) {
        Ok(r) => Ok(r),
        Err(e) => {
            error!("Error in inference: {text:?} {e}");
            Err("Error during inference!")
        }
    }
}

And, we update our Instruct::infer() method to compose the Response object instead of printing out the text

instruct/src-tauri/src/app.rs

 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
impl Instruct {
    // .. code omitted ...

    /// a method to run inference with the loaded model
    pub fn infer(&self, cmd: &str) -> Result<Response> {
        // .. code omitted..

        let start = std::time::Instant::now();

        let mut response = Vec::new();
        // Early stopping - break when you reach max token, `end-of-sequence` id or `end-of-turn` id
        for token in completions {
            decoded_tokens += 1;
            
            // .. code omitted .. breaking conditions

            response.push(self.model.token_to_piece(token));
        }

        let res = Response::new(response.join("").as_str(), decoded_tokens, (std::time::Instant::now() - start).as_secs());

        Ok(res)
    }
}

Add a info!() to the infer() call of our test case and run:

cd src-tauri
RUST_LOG=info cargo test --release -- --nocapture
cd ..

and we should now get something like

Response {
    text: "A comedic science fiction series by Douglas Adams that follows the misadventures of an unwitting human named Arthur Dent as he travels through space with his friend Ford Prefect, an alien who is researching Earth for a travel guide. The story explores themes of identity, technology, and humanity's place in the universe.",
    meta: Meta {
        n_tokens: 63,
        n_secs: 3,
    },
}

Wrapping up and next steps

In this post we got the backend of our desktop application to a working state, it loads a model and can follow our instruction to generate some output. In the final Part III of this series we’ll create the client side interface and get the app ready for an end-to-end inference.

Before we close today ..

If you have found this post helpful consider spreading the word, it would act as a strong motivator for me to create more. In case you chance upon an issue or hit a snag, reach out to me @beingAnubhab.

Acknowledgements & reference

This project is built on the shoulder of stalwarts, a huge shout-out to all of them

Rust maintainers for this awesome language
The tauri app and it’s creators and maintainers
Meta for creating LLaMA family of models and giving open-source AI a fair shot
HuggingFace🤗 for everything they do
Georgi Gerganov for creating GGML/ GGUF movement
llama.cpp maintainers for moving at breakneck speed
llama_cpp-rs maintainers for the awesome yet simple crate
Quant Factory Team for the GGUF model files

And many, many more …

The series:#

Goals#

Who is this for?#

Tools and Libraries#

TL;DR#

Backend#

Let’s Plan#

Instantiation#

A quick introduction to GGML/ GGUF#

The Model#

Step 1. Defining a directory to hold the model data: Let’s call it app_data_dir#

Step 2. Check if our model file exists in our app_data_dir or download it#

Step 3: Loading the model#

Handlers#

Inference#

Wrapping up and next steps#

Before we close today ..#

Acknowledgements & reference#

The series:

Goals

Who is this for?

Tools and Libraries

TL;DR

Backend

Let’s Plan

Instantiation

A quick introduction to GGML/ GGUF

The Model

Step 1. Defining a directory to hold the model data: Let’s call it `app_data_dir`

Step 2. Check if our model file exists in our `app_data_dir` or download it

Step 3: Loading the model

Handlers

Inference

Wrapping up and next steps

Before we close today ..

Acknowledgements & reference