In the spirit of AI for everyone we started building quick desktop application to interface with the very capable LLaMA3 8B model by Meta AI in our previous post.
The series:
Goals
We completed the basic setup in our previous post. In this part of the series, we’ll load the model and run our first AI inference.
At the end of the series, we should have an application that looks like this:
Who is this for?
You’ll feel right at home if you are a programmer, have some exposure to Rust
and a bit of experience working with Svelte
, React
or any other modern client-side framework.
Tools and Libraries
- Rust - Install Rust
- Tauri - A cross-platform desktop app toolkit built on Rust
- SvelteKit - For the quick and simple UI
- Llama.cpp - another micro revolution of democratizing AI Models spearheaded by Georgi Gerganov
- llama_cpp-rs - a rust library that provides simple, high-level bindings over llama.cpp
TL;DR
Backend
Let’s Plan
- We’ll need a
struct
to instantiate the app. This will initialize themodel
and runinference
on the incoming instructions. - We’ll require a way of communicating our
instructions
andinference
between the frontend and the backend. - A way of downloading the model and storing it somewhere during the first launch.
- Some logging and handling errors without creating a mess. For logging my default go-to crates are a combination of
pretty_env_logger
andlog
, for errors I doubt there’s anything simpler than the awesomeanyhow
crate by the master David Tolnay.
Add these crates to our instruct/src-tauri/Cargo.toml
.
|
|
Instantiation
Let’s create a new file app.rs
inside the dir instruct/src-tauri/src
and stub out a empty struct Instruct
.
|
|
It’s obvious that our struct Instruct
doesn’t do anything yet, but stubbing out the basic building blocks often helps me conceptualize and build faster. We’ll see how this unfolds.
|
|
Let’s run the app once more to ensure everything is fine so far, and now that we have a basic logging set up lets run it with the prefix RUST_LOG=info
RUST_LOG=info cargo tauri dev
Our app should compile and pull up an app window once more with the default Welcome message from SvelteKit.
While outlining our plan we had decided that we’ll load (possibly download) the model when we instantiate our struct Instruct
, but so far it is just an empty shell. Let’s change that.
As we scoped out early in this post, we are going to be using the rust library llama-cpp-rs to interface with the gguf
model for inference. Let’s add them to our instruct/src-tauri/Cargo.toml
.
|
|
HintCheck llama_cpp features for available backends
A quick introduction to GGML/ GGUF
GGML (Generalized Gradient Machine Learning) is a fast, web-friendly and powerful tensor library for machine learning without any external dependencies. It enables large models and high performance on regular computers, with features like:
- Efficient computation
- Reduced memory usage
- Easy model training
- Optimizations for various architectures
Quoting the creators on GGUF:
GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.
To summarize, GGML tensor library and GGUF file format just empowers us to execute ML model inference in resource constrained environments without very little additional setup.
ResourcesA few resources to dig deeper into GGML/ GGUF:
The Model
To load a model for inference we’ll need the following steps:
Step 1. Defining a directory to hold the model data: Let’s call it app_data_dir
Tauri
provides an API tauri::api::path::data_dir
which resolves to the conventional directory where apps can offload their data. To further distinguish a data directory specific to our app we’ll simply give our app a scope
; basically, suffix the default data directory with a namespace unique to our app. Let’s write a few utility functions for this and use it during our launch.
|
|
And we modify our main()
function to call this at launch.
|
|
Step 2. Check if our model file exists in our app_data_dir
or download it
We define the HuggingFace🤗 model repo
and model file
as constants and call a function to check if file exists, or download.
|
|
Let’s define a function to check if the model file exists or download for us. We’ll use the HuggingFace🤗 hf-hub
crate to do the downloading for us. We’ll add this crate to our Cargo.toml
and use it in our download_model()
function.
|
|
Let’s run the app once more to verify if everything has gone as per plan.
RUST_LOG=info cargo tauri dev
<app_data_dir>/portal.llm/Meta-Llama-3-8B-Instruct.Q8_0.gguf
.
NoteTheq8
variant of this model is ~8GB, depending on your internet speed, this might take a while
Step 3: Loading the model
Loading the model is simple using llama_cpp
crate as it does a lot of heavy lifting for us. We simply use the llama_cpp::LlamaModel::load_from_file()
API with the path to our downloaded model and default params llama_cpp::LlamaParams::default()
.
HintThellama_cpp::LlamaParams
struct allows us to configure (among other things) the number ofTransformer Layers
you want to load in your GPU memory. In a resource constrained environment, we may choose to offload sometransformer layers
from the model to CPU. CPU offloading is great because it enables us to use models larger than ourVRAM
orGPU Memory
but comes at the cost of inference speed. In our current project I’m working with Apple M1 Pro with16GB
of unified memory, aq8
model should load just fine.
We’ll also modify our struct Instruct
to hold the instance of the LlamaModel
. Our initialization code will start to look something like the following:
|
|
Now, NOW, NOW .. let’s run the app once more.
RUST_LOG=info cargo tauri dev
… drumrolls … THE MODEL LOADS
If you have come this far, congratulate yourself, most of the work is done and you have done great!
Handlers
We need some mechanism of communication between the front-end interface and the backend, the front-end will capture the user’s question, transmit it to the backend for the LLM inference, receive the response and display it. This is starting to look like the classical and familiar server <> client
architecture, isn’t it?
If your answer is YES that’s because it’s exactly that - a server <> client
architecture. Tauri achieves this communication with something called Commands, this enables the client
to invoke rust
functions from the backend. I often imagine commands
to be like rPC
calls for wrapping my head around the concepts (technically they are probably not the same).
Let’s create command handler, we’ll use this to pass our question/ instruction from the client
interface to the backend
.
We’ll create a new file instruct/src-tauri/src/commands.rs
and of course declare it as mod commands;
in our main.rs
.
|
|
Now, let’s add the command handler
in the new command.rs
file. Notice the tauri::command
macro, this marks the function as a command handler and wraps it with the necessary glue code. An important note, the data types used in the input and return tauri::command
annotated function MUST be Serializable
.
|
|
NoteNotice how we are passing the app stateInstruct
through the state argument of the function,tauri
does the heavy lifting for this magic and you’ll find this pattern in most rust web frameworks - axum and others.
We’ll update the tauri
initialization in main()
function to account for this command handler
.
|
|
Inference
By now, we have most of the blocks ready for the backend to work. We’ll focus on the processing logic and inference for the time being.
The flow will work as follows:
Step 0. [CLIENT, LATER] The user keys in a Question/ Instruction
Step 1. The `handler ask()` receives this `command`
Step 2. The `ask()` function must delegate the inference to some method of `struct Instruct` because that struct is the custodian of the actual model
Step 3. `ask()` function receives the inference or generated text as a `return` statement and `returns` it to the client
Step 4. [CLIENT, LATER] User sees the response of his Question/ Instruction
We’ll be working on Step 1, Step 2 and Step 3. Let’s begin by creating a method infer()
for the struct Instruct
which will do the real inference for us.
|
|
We start by creating a inference context
or session
for an incoming command
, feed the incoming prompt
. Then, we are setting the max number of tokens to generate and just keeping a track of the decoded tokens.
Then we trigger the generation loop and provide it with a default Sampler
.
Note about SamplingAuto-regressive models (like GPT, LLaMA …) generate a probability for every token in the vocabulary as its output.
Sampling
is the method used to choose the nearly correctnext
token. E.g. Atop-p
sampling would basically use a cutoff to select a set of tokens in the range of top probability and select one of the valid candidates,top-k
would simply selectk
tokens from a list of tokens sorted by probability and select one from that set and so on. Today,sampling
is often done as a chain of operations instead of just one technique.Read more about
sampling
techniques and how it works.
ExcerciseDitch the default Sampler and experiment with various Sampler chains. You can also have a lot of fun figuring out howSampling
techniques impact the generation.
You may notice that we are not really returning any real information from our infer()
method, we’ll change that. But before going there, lets write a test
to see if our inference
flow works.
|
|
It’s a simple test
, we just initialize our struct Instruct
and call the infer
method with a prompt
. Let’s run it.
cd src-tauri
RUST_LOG=info cargo test --release -- --nocapture
cd ..
And there you go, your first desktop inference.
If you follow the text, you’ll realize that though its coherent with our question
but not quite what we expect it to be. The issue here is that we are using an instruction tuned
model not the base text completion
one. Instruction tuned models expect input prompt in a certain format. Let’s try and fix this.
Let’s write a helper function to create a instruct template from our incoming command
.
|
|
This will convert an incoming text to a structure that LLaMA3 is trained to understand and follow, lets break it down:
<|begin_of_text|> -- a special token that tells the model that our text starts here
<|start_header_id|>role: (`user` | `system` | `assistant`)<|end_header_id|> - this part defines who is providing the instruction or text. The `assistant` persona is used for the model to use and sort of serves as a trigger for the model to fill in.
.. text associated with the role ..
<|eot_id|> - the `end of turn` special token which tells the model that one `turn` of input has ended. This is especially useful for `multi-tern` conversations and also serves as one of our `stop` or `break` tokens
Now, we’ll use this helper function in out infer()
method to convert our incoming text command to a templated prompt. We’ll also modify the infer()
method to break on certain tokens.
|
|
Let’s test this again ..
cd src-tauri
RUST_LOG=info cargo test --release -- --nocapture
cd ..
And … WALLAH .. it works, a smart summary to our question test What is the book 'A Hitchhiker's guide to the galaxy' all about?
NoteIf you notice the code above, you’ll realize that I’ve hardcoded a specificeot_id
to 128009. That’s becauseself.model.eot_id()
is not giving us the correct token id, not sure why!!!
It’s pretty straight-forward from here on to call the infer()
method from out ask()
handler and get the result string.
We add a little more details to our struct Response
and modify our ask()
handler:
|
|
And, we update our Instruct::infer()
method to compose the Response
object instead of printing out the text
|
|
Add a info!()
to the infer()
call of our test case and run:
cd src-tauri
RUST_LOG=info cargo test --release -- --nocapture
cd ..
and we should now get something like
Response {
text: "A comedic science fiction series by Douglas Adams that follows the misadventures of an unwitting human named Arthur Dent as he travels through space with his friend Ford Prefect, an alien who is researching Earth for a travel guide. The story explores themes of identity, technology, and humanity's place in the universe.",
meta: Meta {
n_tokens: 63,
n_secs: 3,
},
}
Wrapping up and next steps
In this post we got the backend of our desktop application to a working state, it loads a model and can follow our instruction to generate some output. In the final Part III of this series we’ll create the client side interface and get the app ready for an end-to-end inference.
Before we close today ..
If you have found this post helpful consider spreading the word, it would act as a strong motivator for me to create more. In case you chance upon an issue or hit a snag, reach out to me @beingAnubhab.
Acknowledgements & reference
This project is built on the shoulder of stalwarts, a huge shout-out to all of them
- Rust maintainers for this awesome language
- The tauri app and it’s creators and maintainers
- Meta for creating LLaMA family of models and giving open-source AI a fair shot
- HuggingFace🤗 for everything they do
- Georgi Gerganov for creating GGML/ GGUF movement
- llama.cpp maintainers for moving at breakneck speed
- llama_cpp-rs maintainers for the awesome yet simple crate
- Quant Factory Team for the GGUF model files
And many, many more …