Part 2: Desktop App for Document QA with RAG - Vector Storage
A DIY style step-by-step guide to building your own cutting-edge GenAI-powered document QA desktop app with RAG. In Part 2 of this series we create our own mini Vector Store.
August 15, 2024 · 21 min · 4443 words
In this series we building our own local Retrieval Augmented Generation (RAG) desktop application for Document QA.
We got our Embedding flow working along with some smart text-splitting techniques in our previous post. In this post we are building a toy (but working) Vector Database to store, search and retrieve data based on the document embeddings.
A vector store is a specialized database designed to efficiently store, search, and manage high-dimensional vectors, such as embeddings. The primary purpose of a vector store is to enable fast and accurate similarity searches, such as:
Nearest neighbor searches
Approximate nearest neighbor searches
Range searches
Ideally a vector store or vector database should have the following features
High-performance indexing: Efficient indexing of high-dimensional vectors for fast search and retrieval.
Scalability: Ability to handle large volumes of vectors and scale horizontally.
Low-latency queries: Fast query response times, even with high-dimensional vectors.
Data management: Support for vector CRUD operations.
Vector stores come in many shapes and form, from open source to fully-managed, embedded stores to distributed databases. Some well known battle-tested databases have also started to offer vector storage functionalities. I’ll just mention a few common databases I’ve worked with, evaluated or know of. The decision ultimately will be governed by the use-case, resource and familiarity.
For specialized vector stores take a look at Faiss, Pinecone, Milvus, qDrant or Chroma DB. I’ve always found the simplicity of Spotify ANNOY fascinating. Large battle-tested databases also provide vector storage functionalities like MongoDB Atlas or Postgres PGVector. This list is non-exhaustive.
Note
You might find this article about vector databases helpful.
Now, that brings us to the next BIG decision, which one do we use? We need something small, nimble yet fast. Since our end-goal is to perform local RAG, ideally, we’d like to avoid an elaborate setup, docker containers or dedicated server running.
I’d say none of them, let’s build our own bare minimum vector store. All we need is a reasonably efficient way to look up approximate nearest neighbor from the embeddings we generate and, hopefully, the process of building one ends up demystifying the fundamentals of efficient vector search for us.
Warning
Our mini vector store is highly experimental and should not be used for production.
Alas, not a from scratch implementation that I was gunning for. An obligatory Google search on vector search in rust led me to this wonderful article by Fennel AI folks where they go about providing a nice and intuitive explanation and rust implementation of Approximate Nearest Neighbor implementation based on Locality Sensitive Hashing.
Reccomendation
Please give this article a read to understand the underlying concepts. Here’s the code associated with the article that we are going to be building on.
I’ll just go ahead and modify this implementation to work with Candle Tensor instead of Vec<f32>. We’ll use rayon for parallelization and dashmap crate for fast concurrent hashmaps, so don’t forget to add them to your dependencies in src-tauri/Cargo.toml.
Let’s create a new file src-tauri/src/ann.rs and declare it in our src-tauri/src/lib.rs.
We’ll start by difining the structs to represent the HyperPlane and the inner Nodes of the index.
Because we are using Candle Tensor instead of Vec<f32> we have done away with the constant generic parameter N across the implementation. Candle Tensor also comes with standard tensor ops and making custom implementations of Vector Add, Sub, Mul etc. redundant.
Let’s define the struct ANNIndex and it’s methods for index and search.
pubstructANNIndex{trees: Vec<Node>,ids: Vec<usize>,values: Tensor,}implANNIndex{fnbuild_hyperplane(indexes: &[usize],all_vecs: &Tensor,)-> candle_core::Result<(HyperPlane,Vec<usize>,Vec<usize>)>{// Choosing random indices
letsample=indexes.choose_multiple(&mutrand::thread_rng(),2).collect::<Vec<_>>();// cartesian eq for hyperplane n * (x - x_0) = 0
// n (normal vector) is the coefs x_1 to x_n
let(a,b)=(*sample[0],*sample[1]);// Calculate the midpoint
letcoefficients=(all_vecs.i(b)-&all_vecs.i(a)?)?;letpoint_on_plane=((all_vecs.i(b)?+&all_vecs.i(a)?)?/2.)?;// And build the hyperplane
letconstant=coefficients.mul(&point_on_plane)?.sum(0)?.to_scalar::<f32>()?;lethyperplane=HyperPlane{coefficients,constant,};// Classify the vectors as being above or below
let(mutabove,mutbelow)=(vec![],vec![]);for&idinindexes.iter(){ifhyperplane.point_is_above(&all_vecs.i(id)?).map_or(false,|d|d){above.push(id)}else{below.push(id)};}Ok((hyperplane,above,below))}fnbuild_a_tree(max_size: usize,indexes: &[usize],data: &Tensor,)-> candle_core::Result<Node>{// If the number of elements in a tree has reached the configurable `max size of node` then move on
ifindexes.len()<=max_size{returnOk(Node::Leaf(Box::new(LeafNode(indexes.to_vec()))));}// Recursively keep constructing the tree
let(plane,above,below)=Self::build_hyperplane(indexes,data)?;letnode_above=Self::build_a_tree(max_size,&above,data)?;letnode_below=Self::build_a_tree(max_size,&below,data)?;Ok(Node::Inner(Box::new(InnerNode{hyperplane: plane,left_node: node_below,right_node: node_above,})))}fntree_result(query: &Tensor,n: usize,tree: &Node,candidates: &DashSet<usize>)-> usize{// take everything in node, if still needed, take from alternate subtree
matchtree{Node::Leaf(box_leaf)=>{letleaf_values=&(box_leaf.0);letnum_candidates_found=n.min(leaf_values.len());leaf_values.iter().take(num_candidates_found).for_each(|&c|{candidates.insert(c);});num_candidates_found}Node::Inner(inner)=>{letabove=(inner).hyperplane.point_is_above(query).map_or(false,|d|d);let(main,backup)=matchabove{true=>(&(inner.right_node),&(inner.left_node)),false=>(&(inner.left_node),&(inner.right_node)),};matchSelf::tree_result(query,n,main,candidates){kifk<n=>k+Self::tree_result(query,n-k,backup,candidates),k=>k,}}}}/// API for building the index
pubfnbuild_index(num_trees: usize,max_size: usize,data: &Tensor,vec_ids: &[usize],)-> candle_core::Result<Self>{// Trees hold an index into the [unique_vecs] list which is not
// necessarily its id, if duplicates existed
lettrees=(0..num_trees).map(|_|Self::build_a_tree(max_size,vec_ids,data).unwrap()).collect::<Vec<_>>();Ok(Self{trees,ids: vec_ids.to_owned(),values: data.to_owned(),})}/// API for search
pubfnsearch_approximate(&self,query: &Tensor,top_k: usize,cutoff: Option<f32>,)-> candle_core::Result<Vec<(usize,f32)>>{letcandidates=DashSet::new();// Find approximate matches
self.trees.par_iter().for_each(|tree|{Self::tree_result(query,top_k,tree,&candidates);});// Create lists for return
letmutapprox=vec![];letmutidxs=vec![];candidates.into_iter().filter_map(|i|self.values.get(i).ok().map(|t|(i,t))).for_each(|(idx,t)|{approx.push(t);idxs.push(idx);});// sort by distance
// First create a tensor of shape top_k, 1024
letcutoff=cutoff.map_or(0.,|c|c);letapprox=Tensor::stack(&approx,0)?;letmutres=idxs.into_iter().zip(query.matmul(&approx.t()?)?.squeeze(0)?.to_vec1::<f32>()?).filter_map(|(idx,d)|{ifd>=cutoff{Some((self.ids[idx],d))}else{None}}).collect::<Vec<_>>();res.par_sort_unstable_by(|a,b|b.1.total_cmp(&a.1));Ok(res[0..top_k.min(res.len())].to_vec())}}
We’ve introduced a new control parameter cutoff which, if passed, would consider results above the given cutoff value.
Our storage strategy is simple, we’ll maintain everything we need to index and search the data in 3 files -
A .store file for struct Store serialized, this acts like a master lookup and would serve as the primary entry point for our search and indexing. I’ll use bincode for this but it could be any serialization format.
embed.data would store the embeddings Tensor as safetensors. On application init, ANNIndex would be built from this data.
text.data file would save the chunked text corresponding to the tensors. We’ll maintain the starting position and length of each text chunk in struct Store that way when we want to find a particular chunk, we can seek to the start position of the chunk and read through the length for data retrieval.
Let’s create a new file src-tauri/src/store.rs and declare it in src-tauri/src/lib.rs.
// ..import omitted ..
constEMBED_FILE: &str="embed.data";constTEXT_FILE: &str="text.data";constSTORE_FILE: &str=".store";constEMBED_TENSOR_NAME: &str="embed";/// The store would represent data that is indexed.
/// Upon initiation it'd read (or create) a .store, text.data, embed.data file
/// In `text.data` file we'll maintain bytes of text split into embedding chunks. The start index byte, the length of the chunk and some more metadata will be maintained
/// In `embed.data` file we'll maintain byte representations of the tensor, one per each segment.
/// `.store` file will maintain a bincode serialized representation of the `struct Store`
#[derive(Serialize, Deserialize, Default)]pubstructStore{data: Vec<Data>,dir: PathBuf,text_size: usize,#[serde(skip)]data_file: Option<Arc<Mutex<BufReader<File>>>>,#[serde(skip)]index: Option<ANNIndex>,#[serde(skip)]bm25: Option<SearchEngine<usize>>,}/// in `struct Data` we maintain the source file of the data and along with it the chunk of text being indexed
#[derive(Serialize, Deserialize, Debug)]pubstructData{file: FileKind,start: usize,length: usize,deleted: bool,indexed: bool,}implData{pubfnfile(&self)-> &FileKind{&self.file}}#[derive(Serialize, Deserialize, Clone, Debug, PartialEq)]pubenumFileKind{Html(PathBuf),Pdf((PathBuf,usize)),Text(PathBuf),}
Ignore the bm25 field for now, we’ll visit this when we work on the retrieval strategies.
Time to code up the APIs to load, store and create the indices.
// code omitted
implStore{/// Loads data and initializes indexes if files are present in the given directory or creates them
pubfnload_from_file(dir: &Path,num_trees: Option<usize>,max_size: Option<usize>,)-> Result<Self>{letstorefile=dir.join(STORE_FILE);letstore=ifstorefile.is_file(){letmutstore=fs::File::open(storefile)?;letmutbuf=vec![];store.read_to_end(&mutbuf)?;letmutstore=bincode::deserialize::<Store>(&buf)?;store.data_file=Some(Arc::new(Mutex::new(BufReader::new(File::open(dir.join(TEXT_FILE),)?))));ifletOk(m)=dir.join(EMBED_FILE).metadata(){ifm.len()>0{store.build_index(num_trees.map_or(16,|n|n),max_size.map_or(16,|sz|sz))?;}}store}else{fs::File::create_new(storefile)?;fs::File::create_new(dir.join(EMBED_FILE))?;fs::File::create_new(dir.join(TEXT_FILE))?;letmutstore=Self{dir: dir.to_path_buf(),..Default::default()};store.data_file=Some(Arc::new(Mutex::new(BufReader::new(File::open(dir.join(TEXT_FILE),)?))));store.save()?;store};Ok(store)}/// Accepts an iterator of (text, embeddings of text, FileKind) and adds it to the indexed list
pubfninsert(&mutself,text_file: &mutFile,data: &mutimplIterator<Item=(String,Tensor,FileKind)>,)-> Result<()>{// Initialize with a large size to avoid frequent re-allocation
letmuttxt_data=Vec::with_capacity(4096);letmuttensors=Vec::with_capacity(64);whileletSome((txt,tensor,file))=data.by_ref().next(){lettxt=txt.as_bytes();letdata=Data{file,start: self.text_size,length: txt.len(),deleted: false,indexed: true,};tensors.push(tensor);txt_data=[&txt_data,txt].concat();self.data.push(data);self.text_size+=txt.len();}Tensor::stack(&tensors,0)?.save_safetensors(EMBED_TENSOR_NAME,self.dir.join(EMBED_FILE))?;text_file.write_all(&txt_data)?;text_file.sync_all()?;self.save()?;Ok(())}/// Serializes and saves the core struct
pubfnsave(&self)-> Result<()>{letstorebytes=bincode::serialize(&self)?;std::fs::write(self.dir.join(STORE_FILE),&storebytes)?;Ok(())}// We break apart the index builders to separate functions and build the ANN and BM25 index in parallel
fnbuild_index(&mutself,num_trees: usize,max_size: usize)-> Result<()>{letann=Self::build_ann(&self.dir.join(EMBED_FILE),num_trees,max_size,self.data.len(),)?;self.index=Some(ann?);Ok(())}// Builds the ANN index
fnbuild_ann(file: &Path,num_trees: usize,max_size: usize,data_len: usize,)-> Result<ANNIndex>{lettensors=unsafe{safetensors::MmapedSafetensors::new(file)?.load(EMBED_TENSOR_NAME,&candle_core::Device::Cpu)?};letann=ANNIndex::build_index(num_trees,max_size,&tensors,&(0..data_len).collect::<Vec<_>>(),)?;Ok(ann)}}
That should suffice, let’s verify that our storage behaves as expected!
#[cfg(test)]modtests{// import omitted
#[derive(Deserialize)]structWikiNews{title: String,text: String,}#[test]fnstorage_init()-> Result<()>{letdata={letdata=std::fs::read_to_string("../test-data/wiki-smoll.json")?;serde_json::from_str::<Vec<WikiNews>>(&data)?};lettdir=tempdir::TempDir::new("storetest")?;letmutstore=Store::load_from_file(tdir.path(),None,None)?;letmutembed=Embed::new(Path::new("../models"))?;letmutchunks=data.chunks(Embed::STELLA_MAX_BATCH).take(32).filter_map(|c|{letbatch=c.par_iter().map(|t|format!("## {}\n{}",t.title,t.text)).collect::<Vec<_>>();ifletOk(e)=embed.embeddings(&batch){letdata=batch.par_iter().enumerate().map(|(i,t)|{(t.to_string(),e.i(i).unwrap(),FileKind::Text(Path::new("../test-data/wiki-smoll.json").to_path_buf(),),)}).collect::<Vec<_>>();Some(data)}else{None}}).flatten();let(muttext_file,_)=store.files()?;store.insert(&muttext_file,&mutchunks)?;// Ok, now let's test we have saved it right or not
for(i,d)instore.data.iter().enumerate(){letmutdf=store.data_file.as_ref().unwrap().lock().unwrap();letmutbuf=vec![0_u8;d.length];df.seek(std::io::SeekFrom::Start(d.startasu64))?;df.read_exact(&mutbuf)?;letstr=std::str::from_utf8(&buf)?;assert_eq!(str,format!("## {}\n{}",data[i].title,data[i].text));}Ok(())}}
Note
This will take a few minutes. If you are in a hurry, tweak this test to work on a smaller dataset!
On a side note, the safetensors.save() operation seems to be the slowest part of this block!
cd src-tauri
cargo test storage_init --release -- --nocapture
cd ..
With the fundamentals of storage functional, let’s focus our attention on exposing the method Store::search( .. ).
// code omitted
implStore{// code omitted
/// API for search into the index
pubfnsearch(&self,qry: &[Tensor],qry_str: &[String],top_k: usize,ann_cutoff: Option<f32>,with_bm25: bool,// ignore this for now
)-> Result<Vec<(usize,&Data,String,f32)>>{// Giving 75% weightage to the ANN search and 25% to BM25 search
constALPHA: f32=0.75;// Let's get the ANN scores and BM25 scores in parallel
let(ann,_)=rayon::join(||{letann=DashMap::new();ifletSome(index)=&self.index{qry.par_iter().for_each(|q|{letres=matchindex.search_approximate(q,top_k*4,ann_cutoff){Ok(d)=>d,Err(e)=>{eprintln!("Error in search_approximate: {e}");return;}};res.iter().for_each(|(idx,score)|{letidx=*idx;ifletSome(d)=self.data.get(idx){lettxt=ifletOk(c)=self.chunk(d){c}else{return;};letmute=ann.entry(idx).or_insert((d,txt,*score));ife.2<*score{e.2=*score;}}});});}ann},||{// We are just scoping in for some retrieval strategy to be discussed and built upon in the future here!
returnNone;},);// Now, we have the highest ANN for the set of queries
// To normalize the ANN Scores, let's go ahead and get the Max/ Min
letmutann_max=0_f32;letmutann_min=f32::MAX;ann.iter().for_each(|j|{ann_max=j.2.max(ann_max);ann_min=j.2.min(ann_min);});letann_div=ann_max-ann_min;// Ok, to time to normalize our scores and create a combined score for each of them
letmutcombined=ann.par_iter().map(|j|{letid=*j.key();letann_score=1.-(j.2-ann_min)/ann_div;letcombined=ALPHA*ann_score+(1.-ALPHA)*0.;// this hardcoded value of `0.` represents another set of scores to be discussed and computed later
(id,j.0,j.1.clone(),combined)}).collect::<Vec<_>>();combined.par_sort_unstable_by(|a,b|b.3.total_cmp(&a.3));Ok(combined[0..top_k.min(combined.len())].to_vec())}}
Note
The code above contains some odd visit this later kind of placeholders! Sorry about that, that’s a part of the fusion retrieval search strategy that we’ll implement later into the series.
Let’s modify our test to create the indexed storage, load from the file and search through it.
#[cfg(test)]modtests{// import omitted
#[derive(Deserialize)]structWikiNews{title: String,text: String,}#[test]fnstorage_init_load_search()-> Result<()>{// renamed `storage_init` to `storage_init_load_search`
letdata={letdata=std::fs::read_to_string("../test-data/wiki-x-small.json")?;serde_json::from_str::<Vec<WikiNews>>(&data)?};lettdir=tempdir::TempDir::new("storetest")?;letmutstore=Store::load_from_file(tdir.path(),None,None)?;letmutembed=Embed::new(Path::new("../models"))?;letmutchunks=data.chunks(Embed::STELLA_MAX_BATCH).take(32).filter_map(|c|{letbatch=c.par_iter().map(|t|format!("## {}\n{}",t.title,t.text)).collect::<Vec<_>>();ifletOk(e)=embed.embeddings(&batch){letdata=batch.par_iter().enumerate().map(|(i,t)|{(t.to_string(),e.i(i).unwrap(),FileKind::Text(Path::new("../test-data/wiki-smoll.json").to_path_buf(),),)}).collect::<Vec<_>>();Some(data)}else{None}}).flatten();let(muttext_file,_)=store.files()?;store.insert(&muttext_file,&mutchunks)?;// Ok, now let's test we have saved it right or not
for(i,d)instore.data.iter().enumerate(){letmutdf=store.data_file.as_ref().unwrap().lock().unwrap();letmutbuf=vec![0_u8;d.length];df.seek(std::io::SeekFrom::Start(d.startasu64))?;df.read_exact(&mutbuf)?;letstr=std::str::from_utf8(&buf)?;assert_eq!(str,format!("## {}\n{}",data[i].title,data[i].text));}// Now, lets load from the file
letstore=Store::load_from_file(tdir.path(),Some(16),Some(16))?;letqry=embed.query(&["What are the latest news about Iraq?".to_string(),"What is Firefox?".to_string()])?.to_device(&candle_core::Device::Cpu)?;letb=qry.dims2()?.0;letqry=(0..b).filter_map(|i|qry.get(i).ok().and_then(|t|t.unsqueeze(0).ok())).collect::<Vec<_>>();letres=store.search(&qry[..],&[],8,Some(0.36),false)?;println!("Response length: {}",res.len());res.iter().for_each(|(idx,_,txt,score)|{println!("Match[{idx}] score[{score}] ----------\n{txt}\n------------------\n")});Ok(())}}
Note
This will take some time to run. From what I can tell saving the candle tensors as safetensors seem to be the slowest block in this flow. Not a lot to be done here now!
The tensors to create hyperplane as selected at random, so, this test will yield different results for different runs for a small dataset!
cd src-tauri
cargo test storage_init_load_search --release -- --nocapture
cd ..
Let’s look at the output!
running 1 test
Response length: 4
Match score[0.41619474] ----------
## Bulk of Iraqi debt to Paris Club to be forgiven
The nineteen member nations of the Paris Club have agreed to forgive 80 percent of Iraq\'s debt, reports the Reuters news service. The plan will reduce Iraq\'s total debt to the member nations to $7.8 billion, from the original $38.9 billion, over a period of four years.
# ... ommitted for readability ...
The plan is the culmination of a trans-Atlantic struggle over the amount of Iraqi debt to be forgiven, the United States pushing for a 90 to 95 percent reduction while France argued for much less.
------------------
Match score[0.4155748] ----------
## Suicide car bomb attack in Baghdad
On the one year anniversary of Saddam Hussein\'s capture, 13 Iraqis were killed according to AP in a suicide car bomb attack by an al-Qaida-linked terrorist. The attack took place about 9 AM Monday, when many Iraquis were arriving to work, outside the Green Zone which is home to the Iraqi interim government and several embassies. The region is under severe protection.
Authorities said the suicide bomber detonated the car while it was in line at a checkpoint. The explosion was very loud and it could be heard throughout the city.
------------------
Match score[0.40933716] ----------
## Lycos Europe ends its anti-spam campaign
Lycos Europe has ended its anti-spam operation: "Make Love Not Spam." A company spokesperson said the objective of the time-limited campaign was to raise people\'s awareness. The reasons why it ended the campaign was variously reported and speculated in media. The operation, while fairly popular, suffered unexpected troubles and drew criticism from security experts and others from the start.
# ... ommitted for readability ...
Next, one of the targeted sites redirected all traffic to the Lycos\' server, making Lycos itself a target. The company had maintained that its server was immune from the attack. Lycos stopped distributing the program on December 3, 2004 and asked clients to "stay tuned." The company later ended the program.
On December 6, F-Secure reported a virus email disguised as the anti-spam screensaver. When its attachment \(a zip file\) is opened, it self-extracts and installs a \"Trojan horse\" --harmful program disguised as legitimate software. The Trojan horse was set up to monitor keystrokes in order to steal passwords, bank account numbers and other important information.
Lycos' software had been downloaded more than 100,000 times by the end of the campaign.
------------------
Match score[0.4010714] ----------
## Saddam Hussein profited roughly 1B by taking funds from UN program
Investigators said that Saddam Hussein diverted money from the Oil-for-Food Program to pay millions of dollars to families of suicide bombers from the West Bank and Gaza Strip who carried out attacks on Israeli civilians.
Paul Volcker, a former American official investigating the diverted funds scandal, has taken some heat from advocates demanding that he haul United Nations personnel before the US Congress. His reason for not subjecting them to this degree of open scrutiny is that it would have the perverse effect of pushing them into refusing to cooperate with the investigation at all. He plans to release documentary evidence early next year, when his investigation is complete.
------------------
test store::tests::storage_read ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 3 filtered out; finished in 16.84s
Running unittests src/main.rs (target/release/deps/retaugen-8b0ea0594349d0df)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
That looks reasonable 🎉🥳🎉 … the first, second and fourth results look relevant. The second one is off! We’ll fix these, hopefully, when we incorporate various retrieval strategies in Part 5 of this series.
That was a heavy ~4500 words, we created our Mini Vector Store from scratch, indexed some data into it, searched it to get some relevant result. We also figured out a way of saving the data on disk. We still have a lot to cover before we have a working Document QA application.
Next, we need a way of extracting text from .pdf and .txt files, over to the next post!