Retrieval Augmented Generation is an approach in natural language processing that combines the power of large language models with traditional knowledge retrieval. This technique enhances the capabilities of AI systems by allowing them to access and incorporate relevant information from databases or documents during the generation process. RAG addresses some limitations of traditional language models by providing up-to-date, authoritative information and reducing hallucinations.
In this tutorial, we’ll explore how to build a local RAG system - our own Document QA.
Series Snapshot
- Part 1 (this): we implement a workflow to generate
Embeddings
from text data usingStella_en_1.5B_v5
and some context-aware text-splitting techniques using the cratetext-splitter
.- Part 2: we’ll build our own
mini Vector Store
inspired by Spotify’s ANNOY.- Part 3: we create the workflow to analyze and extract text from
- Part 4: we work on the
retrieve-and-answer
flow from our corpus.- Part 5: we implement and evaluate some techniques for a better RAG.
TL; DR
Output
Note: This video has been sped up
Setup
Quickstart
The Choice of Embedding model
In machine learning, embeddings
represent complex data, such as words, images, or users, as dense vectors in a continuous space (like a numerical map). These vectors (visualize them as points in a graph) are arranged in a way that similar things are close together, and dissimilar things are far apart. This captures the semantic meaning and relationships between the data points.
E.g. Word embeddings, like Word2Vec or GloVe, represent words as vectors in a way that captures their meaning and context. For instance, the vector representations for “dog” and “cat” would be closer together than the vectors for “dog” and “car”.
Considerations for Choosing an Embedding Model
ML models tasked with generating Embeddings
from objects are powerful systems but not without their limitations. Some models will perform better for a particular domain than others, so we should take some time (and research) to decide on the right model for our use-case.
I consider the following key points when I go about zeroing on an embedding model:
Modality: This one is obvious, if our use-case requires
image
as input modality, we’ll need an embedding model that can create image embeddings. While working withtext
(which is more common) we’ll need a model that can representtext
as embeddings. Even withtext
there are other considerations. E.g. if you intend to search bywords
,GloVe
orWord2Vec
will probably yield better results but if you are trying to generate embeddings forsentences
andparagraphs
- you’ll need specialists for that.Size: With infinite resource, a transformer-based model with tens of billions of parameters is likely to yield better results. But in real world we’ll often compromise on the
quality
of embeddings in favor of resource at hand. E.g. in our current use-case we plan on using anEmbedding
model along with a vector store and a LLaMA 3.x LLM - all running in 16GB memory machine. We MUST choose a small embedding model for this to even work.Efficiency: If how fast you serve your results outweigh how accurate your result is, you’ll probably end up choosing a model that can generate embeddings fast. As a rule of thumb (especially in case of Transformer-based models), smaller models tend to be faster.
Output and Input Dimensions: Let’s assume we are working with very long-form text and breaking them apart into sentences or paragraphs won’t make sense, we’ll need a model with a large input context. But, if we can work with smaller paragraphs or phrases (e.g. user reviews or queries), large input context would just be a waste of resource, and we’ll probably get worse results because the information out of the embeddings would be too
loose
. Likewise, with output dimensionality, what information is represented should be considered. Very large inputs (say 4096 words) represented in output with size 128 dimensions could mean loss of essential contextual data. Output dimensionality also plays a crucial role in the size of yourEmbedding Storage
, very large output vectors could end up costing us more in terms of vector storages and distance computation.Domain: Generic text models may not perform well for specialized domains like law, medicine etc. We’ll need to evaluate models trained specifically on datasets related to our domain.
Pro tip!Experiment with multiple models on your data and benchmark results quantitatively and qualitatively to narrow in on what works.
Choosing the Embedding Model
We are spoilt for choices 🤩🧐 (thanks to HuggingFace), SOTA models popping up every day. Every time I start on a new project involving Embeddings, I end up in a state of analysis paralysis, spending undue amount of time deciding on the model to use. To liberate myself, here’s my attempt at formulating a set of steps to freeze on my choice of Embedding
model:
- Start with MTEB Leaderboard, navigate to
Retrieval
task. - Look at top model (by Average) that fits my resource size.
- Go to the
model card
for the model and check the following:- Does this model require some sort of special licensing?
- Is the model available in the
framework
I’m using? HuggingFace Transformers or Candle or vanilla PyTorch etc. - If not, can this model be quickly implemented?
- If the answer to the questions above is
NO
, I’ll go back to the leaderboard and check the next best model that fits my bill!
That’s how I ended up on dunzhang/stella_en_1.5B_v5
. When I started working on this project the Stella
family of models were not available in candle-transformers
. So I decided to code it up [PR for Stella_en_1.5B]. Later, along with Github user iskng we added support for the smaller Stella_en_400M[PR] to Candle
.
We’ll use Stella_en_1.5B_v5 model in this project.
The ONNX WayThe easiest way of using a model without having access to the code is with
ONNX
, in fact, we’ll use a Detectron2ONNX
Model in 3rd Installment of this series for Document Layout Analysis.To find out if a model has been exported to
ONNX
, check out thefiles and versions
tab of a model in 🤗 HuggingFace to find the.onnx
file(s). Unfortunately, sometimes that too is not an option. You can try to load the model using thetransformers
library and export it to.onnx
yourself,ONNX runtime
compatibility issues often makes this a non-trivial task.
Embeddings
Splitting Text
We briefly talked about the concept of input
and output
dimensions as one of the factors to consider while choosing our Embedding
model. Let’s zoom into the input dimensions
now. Say we are trying to embed a text block of 9000 words (let’s assume that translates into 8192 tokens), but our Embedding model
is trained on text chunks of input size of 512 tokens
, we’ll need to figure out a of splitting those 9000 words into ~512 token chunks.
A brute force approach could be to simply take chunks of 512 tokens
from the tokenized text and we’ll end up with 8192/ 512 = 16
chunks which would fit into our model’s input, BUT in doing so we are most likely going to be losing valuable contextual information. We might end up with a chunk that looks like:
... and with that we can compute the outer dimensionality. Then, we
In the example above we had an abrupt chopping of text at we
, this will be meaningless to the model.
We could split the text at sentence boundaries (basically splitting at .
, ?
etc.), but this would mean that a model capable of handling 512
input tokens would have most of its inputs smaller than the optimal resulting in inefficient usage of the embedding context
.
What’s the ideal then? I guess, the ideal splitting would be able to chunk the input text along contextual boundaries
and sort of figure out the best fit that should be fed to the Embedding model
.
Thankfully, we have the crate text-splitter
to achive this.
This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.
For now, let’s add this to our Cargo.toml
, we’ll detail out the functionalities in the next section:
|
|
Generating Embeddings
Let’s create a struct Embed
for our embedding related tasks.
|
|
And the initializer for the struct Embed
..
|
|
No magic there, a couple of points of relevance:
The model files are hardcoded in our implementation, it’s always better to download the relevant files in the first run, check the Stella example for some pointers on how to achieve this.
The
tokenizer
initialization here is critical,Stella 1.5B
variant requiresPaddingDirection::Left
whileStella 400M
uses the more standardPaddingDirection::Right
strategy.We are initializing our
TextSplitter
to chunk the incoming text to best fit for theStella
recommended input size of 512 defined as a constant. Notice how we are initializing our splitter with a methodwith_overlap( SPLIT_SIZE / 4 )
. This is a context enrichment RAG technique where we enrich the embedding data by extending the context to neighboring chunks. In our case, we are telling the splitter to add 128 sized texts from neighboring chunks.
Let’s expose an API to generate embeddings for a given text batch.
|
|
Stella_en_v5
is trained on 2 tasks - S2P
: Retrieval task and S2S
: semantic similarity task, for our Document QA we’ll be treating our flow as a S2P
task. The method query(..)
is responsible for converting the user’s input to the prompt template recommended by the authors of Stella
.
The method embeddings(..)
tokenizes a batch of text and runs it through the forward
pass of the model to generate the embeddings for the batch.
We’ll also need a method that would split our text documents into desired chunks.
|
|
So that’s it, our method split_text_and_encode(..)
method accepts a text block, splits it into desired chunks using the TextSplitter
, iterates over the chunks in batches and generates embeddings for the same.
Time for a simple test case to validate the behaviour of our embedding
flow.
|
|
Let’s run this test.
cd src-tauri
cargo test split_and_encode --release -- --nocapture
cd ..
NoteTo demonstrate the effect ofchunking
andsplitting
I have changed thepub const SPLIT_SIZE
to256
instead of512
.
Results …
Looking closely at the screengrab, you’ll see that the first document is smaller than the test chunk size of 256, no split there. The second document has been split into multiple chunks, but the split is not naive, the splitter has attempted to preserve some context. E.g. the first chunk of the second document is just the document heading whereas the second and third chunk has been created from the next paragraph. The text highlighted in yellow is the overlap we spoke about earlier; the splitter has added some contexts overlap between consecutive chunks which hopefully helps us better preserve the meaning of a chunk.
Next steps
Now that we are ready with our embedding generation, the next logical step would be to save these embeddings for search and retrieval. The class of applications that are used for storing and retrieving embeddings are called Vector Stores or Vector Databases. There are plenty of vector stores out there and we could pick one of them, but where’s the fun in that? This series is about demystifying the RAG pipeline and I believe building our own Vector Store would best serve our purpose. In Part 2 of this series, we build our own vector store.