Retrieval Augmented Generation is an approach in natural language processing that combines the power of large language models with traditional knowledge retrieval. This technique enhances the capabilities of AI systems by allowing them to access and incorporate relevant information from databases or documents during the generation process. RAG addresses some limitations of traditional language models by providing up-to-date, authoritative information and reducing hallucinations.

In this tutorial, we’ll explore how to build a local RAG system - our own Document QA.

Series Snapshot
  • Part 1 (this): we implement a workflow to generate Embeddings from text data using Stella_en_1.5B_v5 and some context-aware text-splitting techniques using the crate text-splitter.
  • Part 2: we’ll build our own mini Vector Store inspired by Spotify’s ANNOY.
  • Part 3: we create the workflow to analyze and extract text from .pdf files and integrate our LLaMA Model.
  • Part 4: we work on the retrieve-and-answer flow from our corpus.
  • Part 5: we implement and evaluate some techniques for a better RAG.

TL; DR

Github

Output

Note: This video has been sped up

Setup

Quickstart
  • Rust and Cargo toolchain: install
  • Tauri 2.0 cross-platform app development framework: install
  • HuggingFace Candle install.

The Choice of Embedding model

In machine learning, embeddings represent complex data, such as words, images, or users, as dense vectors in a continuous space (like a numerical map). These vectors (visualize them as points in a graph) are arranged in a way that similar things are close together, and dissimilar things are far apart. This captures the semantic meaning and relationships between the data points.

E.g. Word embeddings, like Word2Vec or GloVe, represent words as vectors in a way that captures their meaning and context. For instance, the vector representations for “dog” and “cat” would be closer together than the vectors for “dog” and “car”.

Considerations for Choosing an Embedding Model

ML models tasked with generating Embeddings from objects are powerful systems but not without their limitations. Some models will perform better for a particular domain than others, so we should take some time (and research) to decide on the right model for our use-case.

I consider the following key points when I go about zeroing on an embedding model:

  1. Modality: This one is obvious, if our use-case requires image as input modality, we’ll need an embedding model that can create image embeddings. While working with text (which is more common) we’ll need a model that can represent text as embeddings. Even with text there are other considerations. E.g. if you intend to search by words, GloVe or Word2Vec will probably yield better results but if you are trying to generate embeddings for sentences and paragraphs - you’ll need specialists for that.

  2. Size: With infinite resource, a transformer-based model with tens of billions of parameters is likely to yield better results. But in real world we’ll often compromise on the quality of embeddings in favor of resource at hand. E.g. in our current use-case we plan on using an Embedding model along with a vector store and a LLaMA 3.x LLM - all running in 16GB memory machine. We MUST choose a small embedding model for this to even work.

  3. Efficiency: If how fast you serve your results outweigh how accurate your result is, you’ll probably end up choosing a model that can generate embeddings fast. As a rule of thumb (especially in case of Transformer-based models), smaller models tend to be faster.

  4. Output and Input Dimensions: Let’s assume we are working with very long-form text and breaking them apart into sentences or paragraphs won’t make sense, we’ll need a model with a large input context. But, if we can work with smaller paragraphs or phrases (e.g. user reviews or queries), large input context would just be a waste of resource, and we’ll probably get worse results because the information out of the embeddings would be too loose. Likewise, with output dimensionality, what information is represented should be considered. Very large inputs (say 4096 words) represented in output with size 128 dimensions could mean loss of essential contextual data. Output dimensionality also plays a crucial role in the size of your Embedding Storage, very large output vectors could end up costing us more in terms of vector storages and distance computation.

  5. Domain: Generic text models may not perform well for specialized domains like law, medicine etc. We’ll need to evaluate models trained specifically on datasets related to our domain.

Pro tip!
Experiment with multiple models on your data and benchmark results quantitatively and qualitatively to narrow in on what works.

Choosing the Embedding Model

We are spoilt for choices 🤩🧐 (thanks to HuggingFace), SOTA models popping up every day. Every time I start on a new project involving Embeddings, I end up in a state of analysis paralysis, spending undue amount of time deciding on the model to use. To liberate myself, here’s my attempt at formulating a set of steps to freeze on my choice of Embedding model:

  1. Start with MTEB Leaderboard, navigate to Retrieval task.
  2. Look at top model (by Average) that fits my resource size.
  3. Go to the model card for the model and check the following:
    • Does this model require some sort of special licensing?
    • Is the model available in the framework I’m using? HuggingFace Transformers or Candle or vanilla PyTorch etc.
    • If not, can this model be quickly implemented?
  4. If the answer to the questions above is NO, I’ll go back to the leaderboard and check the next best model that fits my bill!

mteb leaderboard

That’s how I ended up on dunzhang/stella_en_1.5B_v5. When I started working on this project the Stella family of models were not available in candle-transformers. So I decided to code it up [PR for Stella_en_1.5B]. Later, along with Github user iskng we added support for the smaller Stella_en_400M[PR] to Candle.

We’ll use Stella_en_1.5B_v5 model in this project.

The ONNX Way

The easiest way of using a model without having access to the code is with ONNX, in fact, we’ll use a Detectron2 ONNX Model in 3rd Installment of this series for Document Layout Analysis.

To find out if a model has been exported to ONNX, check out the files and versions tab of a model in 🤗 HuggingFace to find the .onnx file(s). Unfortunately, sometimes that too is not an option. You can try to load the model using the transformers library and export it to .onnx yourself, ONNX runtime compatibility issues often makes this a non-trivial task.

Embeddings

Splitting Text

We briefly talked about the concept of input and output dimensions as one of the factors to consider while choosing our Embedding model. Let’s zoom into the input dimensions now. Say we are trying to embed a text block of 9000 words (let’s assume that translates into 8192 tokens), but our Embedding model is trained on text chunks of input size of 512 tokens, we’ll need to figure out a of splitting those 9000 words into ~512 token chunks.

A brute force approach could be to simply take chunks of 512 tokens from the tokenized text and we’ll end up with 8192/ 512 = 16 chunks which would fit into our model’s input, BUT in doing so we are most likely going to be losing valuable contextual information. We might end up with a chunk that looks like:

... and with that we can compute the outer dimensionality. Then, we

In the example above we had an abrupt chopping of text at we, this will be meaningless to the model.

We could split the text at sentence boundaries (basically splitting at ., ? etc.), but this would mean that a model capable of handling 512 input tokens would have most of its inputs smaller than the optimal resulting in inefficient usage of the embedding context.

What’s the ideal then? I guess, the ideal splitting would be able to chunk the input text along contextual boundaries and sort of figure out the best fit that should be fed to the Embedding model.

Thankfully, we have the crate text-splitter to achive this.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

For now, let’s add this to our Cargo.toml, we’ll detail out the functionalities in the next section:

src-tauri/Cargo.toml
1
2
3
4
[dependencies]
..
text-splitter = { version = "0", features = ["tokenizers"] }
..

Generating Embeddings

Let’s create a struct Embed for our embedding related tasks.

src-tauri/src/embed.rs
11
12
13
14
15
16
17
18
19
// Imports omitted

// A container for our embedding related tasks
pub struct Embed {
    device: Device,
    model: stella_en_v5::EmbeddingModel,
    tokenizer: Tokenizer,
    splitter: TextSplitter<Tokenizer>,
}

And the initializer for the struct Embed ..

src-tauri/src/embed.rs
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// .. imports and declarations omitted ..

impl Embed {
    // Constants
    // The max batch size we can pass to the stella model for generating embeddings.
    // The size `8` is based on my 16GB memory considering memory requirements for the application
    // including other models and other runtime memory requirement
    pub const STELLA_MAX_BATCH: usize = 8;
    // Split size is based on the recommended `input` tokens size
    pub const SPLIT_SIZE: usize = 512;
    // The paths to files required for the Stella_en_1.5B_v5 to run
    // Check example: https://github.com/huggingface/candle/tree/main/candle-examples/examples/stella-en-v5 for smaller model
    pub const BASE_MODEL_FILE: &'static str = "qwen2.safetensors";
    pub const HEAD_MODEL_FILE: &'static str = "embed_head1024.safetensors";
    pub const TOKENIZER_FILE: &'static str = "qwen_tokenizer.json";

    pub fn new(dir: &Path) -> Result<Self> {
        let cfg = stella_en_v5::Config::new_1_5_b_v5(stella_en_v5::EmbedDim::Dim1024);

        let device = select_device()?;

        // unsafe inherited from candle_core::safetensors
        let qwen = unsafe {
            VarBuilder::from_mmaped_safetensors(
                &[dir.join(Self::BASE_MODEL_FILE)],
                candle_core::DType::BF16,
                &device,
            )?
        };

        let head = unsafe {
            VarBuilder::from_mmaped_safetensors(
                &[dir.join(Self::HEAD_MODEL_FILE)],
                candle_core::DType::F32,
                &device,
            )?
        };

        let model = stella_en_v5::EmbeddingModel::new(&cfg, qwen, head)?;
        let mut tokenizer =
            Tokenizer::from_file(dir.join(Self::TOKENIZER_FILE)).map_err(|e| anyhow!(e))?;
        let pad_id = tokenizer.token_to_id("<|endoftext|>").unwrap();

        tokenizer.with_padding(Some(PaddingParams {
            strategy: PaddingStrategy::BatchLongest,
            direction: PaddingDirection::Left,
            pad_id,
            pad_token: "<|endoftext|>".to_string(),
            ..Default::default()
        }));

        let splitter = TextSplitter::new(
            ChunkConfig::new(Self::SPLIT_SIZE)
                .with_sizer(tokenizer.clone())
                .with_overlap(Self::SPLIT_SIZE / 4)?,
        );

        Ok(Self {
            device,
            model,
            tokenizer,
            splitter,
        })
    }
}

No magic there, a couple of points of relevance:

  1. The model files are hardcoded in our implementation, it’s always better to download the relevant files in the first run, check the Stella example for some pointers on how to achieve this.

  2. The tokenizer initialization here is critical, Stella 1.5B variant requires PaddingDirection::Left while Stella 400M uses the more standard PaddingDirection::Right strategy.

  3. We are initializing our TextSplitter to chunk the incoming text to best fit for the Stella recommended input size of 512 defined as a constant. Notice how we are initializing our splitter with a method with_overlap( SPLIT_SIZE / 4 ). This is a context enrichment RAG technique where we enrich the embedding data by extending the context to neighboring chunks. In our case, we are telling the splitter to add 128 sized texts from neighboring chunks.

Let’s expose an API to generate embeddings for a given text batch.

src-tauri/src/embed.rs
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
impl Embed {
    // Code omitted ..

    // Prepends `prompt` template and tokenizes a `query`
    pub fn query(&mut self, query_batch: &[String]) -> Result<Tensor> {
        let q = query_batch
            .par_iter().map(|q| format!("Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: {query}"))
            .collect::<Vec<_>>();

        self.embeddings(&q)
    }

    // Tokenizes a doc text batch
    pub fn embeddings(&mut self, batch: &[String]) -> Result<Tensor> {
        let mut token_batch = self.tokenize(batch)?;
        let mut ids = Tensor::zeros(
            (token_batch.len(), token_batch[0].get_ids().len()),
            DType::U32,
            &self.device,
        )?;
        let mut masks = Tensor::zeros(
            (token_batch.len(), token_batch[0].get_ids().len()),
            DType::U8,
            &self.device,
        )?;

        for (i, e) in token_batch.drain(..).enumerate() {
            let input_id = Tensor::from_iter(e.get_ids().to_vec(), &self.device)?.unsqueeze(0)?;
            let mask = Tensor::from_iter(e.get_attention_mask().to_vec(), &self.device)?
                .to_dtype(DType::U8)?
                .unsqueeze(0)?;

            ids = ids.slice_assign(&[i..i + 1, 0..input_id.dims2().unwrap().1], &input_id)?;
            masks = masks.slice_assign(&[i..i + 1, 0..mask.dims2().unwrap().1], &mask)?;
        }

        Ok(self.model.forward(&ids, &masks)?)
    }
    
    fn tokenize(&self, doc: &[String]) -> Result<Vec<Encoding>> {
        self.tokenizer
            .encode_batch(doc.to_vec(), true)
            .map_err(|e| anyhow!(e))
    }

    // Code omitted ..
}

Stella_en_v5 is trained on 2 tasks - S2P: Retrieval task and S2S: semantic similarity task, for our Document QA we’ll be treating our flow as a S2P task. The method query(..) is responsible for converting the user’s input to the prompt template recommended by the authors of Stella.

The method embeddings(..) tokenizes a batch of text and runs it through the forward pass of the model to generate the embeddings for the batch.

We’ll also need a method that would split our text documents into desired chunks.

src-tauri/src/embed.rs
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
impl Embed {
    // Code omitted ..
    
    pub fn split_text_and_encode(
        &mut self,
        doc: &str,
    ) -> Vec<(std::string::String, candle_core::Tensor)> {
        let splits = self.splitter.chunks(doc).collect::<Vec<_>>();

        splits
            .chunks(Self::STELLA_MAX_BATCH)
            .flat_map(|c| {
                let embed = self
                    .embeddings(&c.iter().map(|c| c.to_string()).collect::<Vec<_>>()[..])
                    .ok()?;
                Some(
                    c.iter()
                        .enumerate()
                        .filter_map(move |(i, &txt)| {
                            if let Ok(t) = embed.i(i) {
                                Some((txt.to_string(), t))
                            } else {
                                None
                            }
                        })
                        .collect::<Vec<_>>(),
                )
            })
            .flatten()
            .collect::<Vec<_>>()
    }
}

So that’s it, our method split_text_and_encode(..) method accepts a text block, splits it into desired chunks using the TextSplitter, iterates over the chunks in batches and generates embeddings for the same.

Time for a simple test case to validate the behaviour of our embedding flow.

src-tauri/src/embed.rs
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
#[cfg(test)]
mod tests {
    #[test]
    fn split_and_encode() -> Result<()> {
        let docs = &[
"There are many effective ways to reduce stress. .. some text ..".to_string(),
"## President of China lunches with Brazilian President

Brazil: Hu Jintao, the President of the People's Republic of China had lunch today with the President of Brazil, Luiz Inácio Lula da Silva, at the Granja do Torto,
.. more text here .. lot more text here ..
A hearing started today over the death of Australian cricket coach David Hookes. Hookes died after an incident outside a hotel in Melbourne, Australia on the 19th of January.".to_string()
        ];

        let mut embed = Embed::new(Path::new("../models"))?;
        docs.iter().enumerate().for_each(|(idx, d)| {
            println!("{idx} ---------------");
            let d = embed.split_text_and_encode(d);
            d.iter().enumerate().for_each(|(chunk, (s, _))| {
                println!("Chunk {chunk} ================");
                println!("{s:?}");
            });
        });
        Ok(())
    }

Let’s run this test.

cd src-tauri
cargo test split_and_encode --release -- --nocapture
cd ..
Note
To demonstrate the effect of chunking and splitting I have changed the pub const SPLIT_SIZE to 256 instead of 512.

Results …

chunk and split

Looking closely at the screengrab, you’ll see that the first document is smaller than the test chunk size of 256, no split there. The second document has been split into multiple chunks, but the split is not naive, the splitter has attempted to preserve some context. E.g. the first chunk of the second document is just the document heading whereas the second and third chunk has been created from the next paragraph. The text highlighted in yellow is the overlap we spoke about earlier; the splitter has added some contexts overlap between consecutive chunks which hopefully helps us better preserve the meaning of a chunk.

Next steps

Now that we are ready with our embedding generation, the next logical step would be to save these embeddings for search and retrieval. The class of applications that are used for storing and retrieving embeddings are called Vector Stores or Vector Databases. There are plenty of vector stores out there and we could pick one of them, but where’s the fun in that? This series is about demystifying the RAG pipeline and I believe building our own Vector Store would best serve our purpose. In Part 2 of this series, we build our own vector store.