Part 5: Desktop App for Document QA with RAG - Techniques
A DIY style step-by-step guide to building your own cutting-edge GenAI-powered document QA desktop app with RAG. In this part 5 and final instalment of this series we evaluate and implement some RAG techniques for better search results.
September 10, 2024 · 34 min · 7230 words
In this blog series on building a Desktop Document QA app, we’ve tackled several key challenges: crafting document embeddings, building a Spotify’s ANNOY-inspired vector store, and developing a document Layout analysis and extraction pipeline powered by a Detection2 model. With indexing and basic QA flows now complete, we’re ready for the next phase of our journey - evaluating and implementing some cutting-edge techniques for a better RAG.
Taking inspiration from Nir Diamant’s comprehensive RAG Techniques reference, we’ll enhance our pipeline by implementing select methods from his catalog.
Document Relevancy Filtering: By using a binary relevancy score generated by a language model, only the most relevant documents are passed on to the answer generation phase, reducing noise and improving the quality of the final answer.
Hallucination Check: Before finalizing the answer, the system checks for hallucinations by verifying that the generated content is fully supported by the retrieved documents.
Snippet Highlighting: This feature enhances transparency by showing the exact segments from the retrieved documents that contributed to the final answer.
Let’s implement Document Relevancy Filtering, Snippet Highlighting and a form of Hallucination Check to ensure the Generator finds evidence from the context before answering.
Tradeoffs
For production systems consider the tradeoffs between accuracy and speed. Some of these techniques would effectively mean we make more LLM inference calls which will slow down our retrieval and increase runtime resource consumption.
For Document Relevancy Filtering we’ll add a method to our struct Generator where we prompt LLaMA to return the ids of the relevant text sections. This is akin to a binary filter, we are asking the LLM to only return the passages that are important while ignoring the rest.
// code omitted ..
#[derive(Debug, Deserialize)]pubstructRelevant{relevant: Vec<DocRelevance>,}#[derive(Debug, Deserialize)]pubstructDocRelevance{id: usize,score: f32,}implRelevant{fnto_list(&self)-> Vec<(usize,f32)>{self.relevant.iter().map(|r|(r.id,r.score)).collect::<Vec<_>>()}}implGenerator{/// Given a set of queries and a set of documents, return a list of indices that are relevant to the queries
pubfnfind_relevant(&mutself,query: &[String],docs: &[(usize,String)],)-> Result<Vec<(usize,f32)>>{letdocfmt=docs.iter().map(|(idx,txt)|format!("Id: {idx}\n{txt}\n-------------")).collect::<Vec<_>>().join("\n");letprompt=format!("<|start_header_id|>system<|end_header_id|>
You are an intelligent and dilligient AI who analyses text documents to figure out if a particular document contains relevant information for answering a set of queries. You must follow the given requirements while analysing and scoring the documents for your answer.<|eot_id|><|start_header_id|>user<|end_header_id|>
Documents:
```
{}```
Queries:
```
- {}```
Task:
Identify the ids of documents that are relevant for generating answers to the given queries and rate them in a scale of 1-10 where a score of 10 is most relevant.
Requirements:
- Only include ids of documents containing relevant information.
- If no documents are relevant the field \"relevant\" must be an empty array.
- Do not write any note, introduction, summary or justifications.
- Your answer must be a valid JSON of the following Schema.
Schema:
{{\"relevant\": Array<{{\"id\": numeric id, \"score\": numeric score}}>
}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{\"relevant\": [",docfmt,query.join("\n- "));// println!("Relevance prompt:\n{prompt}");
lettk=self.generate(&prompt)?;matchserde_json::from_str::<Relevant>(format!("{{\n\t\"relevant\": [{tk}").as_str()){Ok(d)=>Ok(d.to_list()),Err(e)=>{println!("Generator::find_relevant: error while deserializing: {e:?}\n{tk:?}\n");Err(anyhow!(e))}}}}
Note that we are getting the model to generate scores in a range of 1 - 10 for each document chunk, this will come in effectively way of implementing the Intelligent Reranking.
In our previous post we had kept a scope for evidence in our answer prompt, time to use that for our Snippet Highlighting implementation.
We added a new struct Evidence to hold an index of the source text chunk along with the supported text during the generation pass. Inspecting our answer prompt in method answer( .. ) you’ll notice that we have already been asking the model to pick up the evidence, so no changes there.
This approach would already tackle hallucinations to some degree, however for a more complete implementation you’ll probably need to pass the response through a separate generation flow to specifically check for the validity of the generation.
Finding the optimal chunk size isn’t straightforward - it depends on your embedding model, vector store capabilities, and nature of your documents. While experimentation is key, our choice is guided by Stella_en_1.5B_v5’s training context of 512.
Breaking down the text into concise, complete, meaningful sentences allowing for better control and handling of specific queries (especially extracting knowledge).
The idea is to modify and/or expand a query to improve retrieval effectiveness by rewriting the original query, step-back prompting to generate broader queries or subqueries for a more holistic retrieval and sub-query decomposition where we can break down a complex query into simpler sub-queries.
Note
We’ll not always need to employ ALL of these techniques; choice of techniques should be governed by the problem at hand. Consider the tradeoffs and experiment with multiple techniques to figure out what works best!
Let’s start with step-back prompting extend to sub-query decomposition if and when we need to. To do this, we add a method to struct Generator preprocess user input query.
// .. code omitted
/// A struct to hold `sub queries` and a `topic`
#[derive(Debug, Deserialize)]pubstructQueryMore{#[serde(skip)]src: String,#[serde(rename = "sub_queries")]more: Vec<String>,topic: String,}implQueryMore{pubfnsource(&self)-> &str{&self.src}pubfnsub_queries(&self)-> &[String]{&self.more[..]}pubfntopic(&self)-> &str{&self.topic}pubfnqueries(&self)-> Vec<String>{[&[self.source().to_string()],self.sub_queries()].concat()}}implGenerator{/// Preprocesses a query to generate `topic` and supplimental queries for `Fusion Retrieval`
pubfnquery_preproc(&mutself,query: &str,num_sub_qry: usize)-> Result<QueryMore>{letprompt=format!("<|start_header_id|>system<|end_header_id|>
You are a smart and intelligent AI assistant generating sub-queries and a topic for a Fusion Retrieval system based on a given source query. You always adhere to the given requirements.<|eot_id|><|start_header_id|>user<|end_header_id|>
Given a source query that may require additional context or specific information, generate relevant sub-queries to retrieve more accurate results. Identify a word or a very short phrase that represents the topic of the query.
Source Query:
{query}Generate {num_sub_qry} relevant sub-queries that:
- Are closely related to the source query
- Can be used to retrieve additional context or specific information
- Are concise and clear
Requirements:
- Sub-queries should not repeat the source query
- Sub-queries should be relevant to the source query's intent, purpose and context
- use natural language for sub queries
- your answer should be a valid json of the following schema.
Schema:
{{ sub_queries: Array<string>,
topic: string
}}Answer must be a valid json.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{\"sub_queries\": [\"");lettk=self.generate(&prompt)?;letmutres=serde_json::from_str::<QueryMore>(format!("{{\n\"sub_queries\": [\"{tk}").as_str())?;res.src=query.to_string();Ok(res)}}
Todo: write a test!
Test query_more( .. ) and edit the prompt if need be!
The following are the results from my test runs.
// calling method `fn sub_query(
// "What are the impacts of climate change on the environment?",
// 4
//)`
QueryMore{src: "What are the impacts of climate change on the environment?",more: ["What are the effects of climate change on biodiversity?","How does climate change affect sea levels?","What are the economic impacts of climate change?","How does climate change impact human migration?",],topic: "climate change"}// calling method `fn sub_query(
// "What are the latest news about Iraq?",
// 4
//)`
QueryMore{src: "What are the latest news about Iraq?",more: ["Iraqi government updates","Latest news on ISIS in Iraq","Iraq news and current events","Humanitarian situation in Iraq",],topic: "Iraq"}
Neat … our base query has been expanded into related queries which is very likely to yield better results!!
Here we retrieve the relevant chunk along with its neighbors to provide a better context to the Generator. We have already implemented a flavor of this when we used the overlap parameter in text splitting, but that was while generating the embeddings. Let’s extend the concept such that the final context generated contains k adjacent chunks.
We need a way of getting adjacent text blocks in the same file, but because of the overlap we’ll also need to remove the duplicate between two chunks.
/// The end text of prev would be common with the begining of current
/// E.g.
/// prev: Hello, how are you? Life is good!
/// current: Life is good! The act of creation keeps us buzy!
// The strategy is simple, we'll pick up the midpoint of current and keep checking backwards if prev ends with the text
// This will be pretty efficient because with max token size of 512 and a known overlap factor which is 1/4th of the token size
// this would yield reasonable results
pubfndedup_text(prev: &str,current: &str)-> Result<String>{letcur=current.as_bytes();letprv=prev.as_bytes();letmutpointer=cur.len()/2;whilepointer>0{ifprv.ends_with(&cur[0..pointer]){break;}pointer-=1;}Ok(std::str::from_utf8(&prv[0..prv.len()-pointer])?.to_string())}
Now, a method to struct Store to return k adjacent chunks before and after the selected index.
implStore{// code omitted ..
/// Given an index `idx` returns `k` adjacent chunks before and after the index
/// Returns k text blocks before with overlap removed, the current text with overlap removed and k text blocks after, again overlap removed
pubfnwith_k_adjacent(&self,idx: usize,k: usize,)-> Result<(Vec<String>,String,Vec<String>)>{// Let's collect all indices that need to be fethed
// We have to ensure indices that are in the SAME source file
letstart=idx.saturating_sub(k);letend=(idx+k+1).min(self.data.len());lettrg_data=ifletSome(d)=self.data.get(idx){d}else{eprintln!("Nothing found for index {idx}. Corrupt store!");returnErr(anyhow!("corrupt store!"));};lettrg_src=match&trg_data.file{FileKind::Text(p)=>p.as_path(),FileKind::Pdf((p,_))=>p.as_path(),FileKind::Html(p)=>p.as_path(),};letmutchunks: Vec<(String,usize)>=Vec::with_capacity(end-start);(start..end).for_each(|index|{letdata=ifindex==idx{trg_data}elseifletSome(d)=self.data.get(index){d}else{eprintln!("Nothing found for data point {index}");return;};letsrc=match&data.file{FileKind::Text(p)=>p.as_path(),FileKind::Pdf((p,_))=>p.as_path(),FileKind::Html(p)=>p.as_path(),};// Not neighbors if indices are not from the same source file
ifsrc!=trg_src{return;}lettxt=ifletOk(txt)=self.chunk(data){txt}else{return;};if!chunks.is_empty(){leti=chunks.len()-1;chunks[i].0=ifletOk(t)=dedup_text(&chunks[i].0,&txt){t}else{return;}}chunks.push((txt,index));});// We have deduplicated text, let's prepare them in the before after kind of structure
letmutresult=(vec![],String::new(),vec![]);chunks.into_iter().for_each(|(s,i)|matchi.cmp(&idx){Ordering::Less=>result.0.push(s),Ordering::Equal=>result.1=s,Ordering::Greater=>result.2.push(s),});Ok(result)}/// Given a datapoint, returns the text chunk for that datapoint
pubfnchunk(&self,data: &Data)-> Result<String>{letdf=ifletSome(df)=self.data_file.as_ref(){df}else{returnErr(anyhow!("Store not initialized!"));};letmutf=df.lock().map_err(|e|anyhow!("error acquiring data file lock: {e:?}"))?;f.seek(std::io::SeekFrom::Start(data.startasu64))?;letmutbuf=vec![0;data.length];f.read_exact(&mutbuf)?;String::from_utf8(buf).map_err(|e|anyhow!(e))}}
We figure out the start and end indices based on the given index and k. Then we fetch the relevant chunk, deduplicate them before returning.
To deal with our search context exploding we compress retrieved information while preserving query-relevant content. We decide on text chunks to compress through summarization based on some heuristics and then we call the LLM to generate summary such that the context ad query related information is not lost.
Let’s add a method to struct Generator for summarization:
// code omitted ..
#[derive(Debug, Deserialize)]pubstructSummary{heading: String,summary: String,}implSummary{pubfnsummary(&self)-> &str{&self.summary}pubfnheading(&self)-> &str{&self.heading}}implGenerator{/// Generates summaries of given text
pubfnsummarize(&mutself,queries: &str,context: &str)-> Result<Summary>{letprompt=format!("<|start_header_id|>system<|end_header_id|>
You are a smart and intelligent AI assistant generating a heading and summary of a given data so that it can be used for answering the user queries.<|eot_id|><|start_header_id|>user<|end_header_id|>
Queries:
```
{queries}```
Data:
```
{context}```
Generate a short summary and a heading for the given data that:
- Reflects the essence, tone and information of the data
- Retains all key facts
- Are concise and clear
- Can be used as evidence to answer given queries
Requirements:
- Heading should reflect the topic and essence of the data
- Summary and heading should be relevant to the source data's intent, purpose and context
- use natural language for summary
- All key facts should be retained
- Summary should not be more than 350 words
Schema:
{{ heading: string,
summary: string
}}Answer must be a valid json.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{\"heading\": \"");lettk=self.generate(&prompt)?;serde_json::from_str::<Summary>(format!("{{\n\"heading\": \"{tk}").as_str()).map_err(|e|anyhow!(e))}}
We’ll leave the decision to call the summarization to the final method App::search( .. ).
In Fusion Retrieval we’ll run a keyword search in parallel with our semantic search and factor in results from both searches for final results.
For the keyword search we’ll use bm25 ranking function.
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters.
implStore{// code omitted
// We break apart the index builders to separate functions and build the ANN and BM25 index in parallel
fnbuild_index(&mutself,num_trees: usize,max_size: usize)-> Result<()>{let(ann,bm25)=rayon::join(||{Self::build_ann(&self.dir.join(EMBED_FILE),num_trees,max_size,self.data.len(),)},||{letdocs=self.data.iter().enumerate().filter_map(|(idx,d)|{letchunk=matchself.chunk(d){Ok(c)=>c,Err(e)=>{eprintln!("Error while reading chunk: {e:?}");returnNone;}};Some(Document{id: idx,contents: chunk,})}).collect::<Vec<_>>();Self::build_bm25(docs)},);self.index=Some(ann?);self.bm25=Some(bm25?);Ok(())}// Builds the BM25 index
fnbuild_bm25(docs: Vec<Document<usize>>)-> Result<SearchEngine<usize>>{letengine=SearchEngineBuilder::<usize>::with_documents(Language::English,docs).build();Ok(engine)}}
Finally, our method search( .. ) needs modifications to incorporate a parallel Nearest Neighbor and BM25 index lookup.
implStore{// code omitted ..
/// API for search into the index
pubfnsearch(&self,qry: &[Tensor],qry_str: &[String],top_k: usize,ann_cutoff: Option<f32>,with_bm25: bool,)-> Result<Vec<(usize,&Data,String,f32)>>{// Giving 75% weightage to the ANN search and 25% to BM25 search
constALPHA: f32=0.75;// Let's get the ANN scores and BM25 scores in parallel
let(ann,bm25)=rayon::join(||{letann=DashMap::new();ifletSome(index)=&self.index{qry.par_iter().for_each(|q|{letres=matchindex.search_approximate(q,top_k*4,ann_cutoff){Ok(d)=>d,Err(e)=>{eprintln!("Error in search_approximate: {e}");return;}};res.iter().for_each(|(idx,score)|{letidx=*idx;ifletSome(d)=self.data.get(idx){lettxt=ifletOk(c)=self.chunk(d){c}else{return;};letmute=ann.entry(idx).or_insert((d,txt,*score));ife.2<*score{e.2=*score;}}});});}ann},||{if!with_bm25{returnNone;}letbm25=DashMap::new();ifletSome(b)=self.bm25.as_ref(){qry_str.par_iter().for_each(|qs|{letres=b.search(qs,top_k*4);res.par_iter().for_each(|r|{letmute=bm25.entry(r.document.id).or_insert(r.score);if*e<r.score{*e=r.score;}});});};Some(bm25)},);// Now, we have the highest ANN and BM25 scores for the set of queries
// We'll need to create a `combined` score of the two
// Based on https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques_runnable_scripts/fusion_retrieval.py
// the steps are:
// 1. Normalize the vector search score
// 2. Normalize the bm25 score
// 3. combined_scores = some alpha * vector_scores + (1 - alpha) * bm25_scores
// To normalize the ANN Scores, let's go ahead and get the Max/ Min
letmutann_max=0_f32;letmutann_min=f32::MAX;ann.iter().for_each(|j|{ann_max=j.2.max(ann_max);ann_min=j.2.min(ann_min);});letann_div=ann_max-ann_min;// And same for bm25 scores
letmutbm25_max=0_f32;letmutbm25_min=f32::MAX;lethas_bm_25=bm25.as_ref().map_or(false,|b|b.is_empty());letbm25_div=ifhas_bm_25{ifletSome(b)=bm25.as_ref(){b.iter().for_each(|j|{bm25_max=j.max(bm25_max);bm25_min=j.min(bm25_min);});bm25_max-bm25_min}else{f32::MIN}}else{f32::MIN};// Ok, to time to normalize our scores and create a combined score for each of them
letmutcombined=ann.par_iter().map(|j|{letid=*j.key();letann_score=1.-(j.2-ann_min)/ann_div;letbm25_score=ifhas_bm_25{letbm25_score=ifletSome(b)=bm25.as_ref().and_then(|b|b.get(&id)){(*b-bm25_min)/bm25_div}else{// Some very small number if not present
0.};bm25_score}else{0.};letcombined=ALPHA*ann_score+(1.-ALPHA)*bm25_score;(id,j.0,j.1.clone(),combined)}).collect::<Vec<_>>();combined.par_sort_unstable_by(|a,b|b.3.total_cmp(&a.3));Ok(combined[0..top_k.min(combined.len())].to_vec())}}
Go ahead and modify our existing test-case to get this to work.
This is the final technique in our consideration, this involves generating a relevance score for every retrieved document against the source query.
There are 2 major ways of achieving this:
LLM Based Scoring: where we send the retrieved documents to a LLM and ask it to generate a relevancy score for each document against the source query.
Cross Encoders: a model which outputs a sentence similarity between 2 input texts.
LLM Based Scoring is simple enough to implement - effectively we just prompt the model to generate a score in a scale of 1-10 for each document. Note that we are already doing this with our method Generator::find_relevant( .. ). We sort the result on the scores for the Re-ranked context.
RAG is like an orchestra - we have all our instruments ready but now comes the art of conducting them in harmony. While each technique brings its own strengths, we need to fine-tune our search configurations to strike the perfect balance between lightning-fast responses and pinpoint accuracy, without letting our model become too rigid or too vague in its outputs.
implApp{// code omitted
// A function to run the `relevance` pass
asyncfnfind_relevant(&self,qry: &[String],cutoff: f32,res: &[StoreDataRepr<'_>],window: &Window,)-> Result<Vec<usize>>{// A relevance cutoff greater than `0` means we have activated the flow
ifcutoff==0.{returnOk(Vec::new());}Self::send_event(window,OpResult::Status(StatusData{head: "Starting `Relevance` and Re-ranking pass".to_string(),body: format!("Relevance cutoff: {}",cutoff),..Default::default()}),).await?;letmutgen=self.gen.lock().await;letllm=ifletSome(gen)=gen.as_mut(){gen}else{returnErr(anyhow!("generator not found"));};letstart=Instant::now();// Sometimes the LLM ends up returning duplicates, this is to clean them out
letmutunq=HashSet::new();// If we send ALL our response, we'll probably run out of context length
// So, let's chunk this
letmutrelevant=res.chunks(8).filter_map(|c|{letbatched=c.par_iter().map(|k|(k.0,k.2.clone())).collect::<Vec<_>>();llm.find_relevant(qry,&batched).ok()}).flatten().filter(|(idx,score)|{ifunq.contains(idx)||*score<cutoff{false}else{unq.insert(*idx);true}}).collect::<Vec<_>>();relevant.par_sort_by(|a,b|b.1.total_cmp(&a.1));// println!("Relevant: {relevant:?}");
Self::send_event(window,OpResult::Status(StatusData{head: format!("Filtered {} relevant results",relevant.len()),body: String::new(),time_s: Some((Instant::now()-start).as_secs_f32()),..Default::default()}),).await?;Ok(relevant.iter().map(|(idx,_)|*idx).collect::<Vec<_>>())}}
Great, so that will return a set of filtered and re-ranked results if the relevance cutoff is > 0.0.
Now it’s time to put together the function to get k adjacent chunks for context enrichment.
implApp{// code omitted ..
// Rethurns k_adjacent text if its > 0
asyncfnk_adjacent(&self,k_adjacent: usize,data: &[StoreDataRepr<'_>],window: &Window,)-> Result<Vec<(usize,String)>>{ifk_adjacent==0{returnOk(data.iter().map(|(idx,_,txt,_)|(*idx,txt.to_owned())).collect::<Vec<_>>());}Self::send_event(window,OpResult::Status(StatusData{head: "Context enhancement: Expanding search context".to_string(),body: format!("<i>K</i> Adjacent: {}",k_adjacent),..Default::default()}),).await?;letstore=self.store.read().await;letstart=Instant::now();letenhanced=data.iter().filter_map(|(idx,_,_,_)|{leta=store.with_k_adjacent(*idx,k_adjacent).ok()?;lettxt=[a.0.join("\n").as_str(),&a.1,a.2.join("\n").as_str()].join("\n\n");Some((*idx,txt))}).collect::<Vec<_>>();Self::send_event(window,OpResult::Status(StatusData{head: format!("Context enhanced with {k_adjacent} adjacent"),body: String::new(),time_s: Some((Instant::now()-start).as_secs_f32()),..Default::default()}),).await?;Ok(enhanced)}}
Finally we’ll generate the final context for our search and summarize parts of the context that cross a certain threshold.
Summarization Threshold
So we have a total context length of 4096 tokens - I’ll leave 1/4th of that for the final answer which leaves us with 3072 tokens for our context (including the system prompt). Now, our default prompt without any additional data takes ~250 tokens, which means we are left with around 2800 tokens for our context.
So, we’ll define our threshold as follows:
while total context > 2800:
Summarize largest chunk
Let’s put together a helper function to canculate tokens for the current context.
implApp{// code omitted
// Computes `token` count related information of given text
// returns: (total_tokens, max_tokens, max_token_idx)
asyncfncompute_tokens(&self,data: &[(usize,String)])-> Result<(usize,usize,usize)>{letmutg=self.gen.lock().await;letgen=ifletSome(g)=g.as_mut(){g}else{returnErr(anyhow!("Generator not ready!"));};// Total size of encoded tokens
letmuttotal_tokens=0;// Chunk with max tokens
letmutmax_token_idx=0;letmutmax_tokens=0;data.iter().enumerate().for_each(|(i,(_,txt))|{lettokenized=gen.tokenize(txt).unwrap().len();total_tokens+=tokenized;iftokenized>max_tokens{max_token_idx=i;max_tokens=tokenized;}});Ok((total_tokens,max_tokens,max_token_idx))}}
Then, the function to create the context for us - it will use the method App::compute_tokens( .. ) to calculate the total context size to decide and execute the summarization( .. ) pass if required.
implApp{// code omitted
// So we have a total context length of *4096* tokens
// leave 1/4th of that for the final *answer* which leaves us with *3072* tokens for our *context* (including the system prompt).
// Now, our default prompt without any aditional data takes *~250* tokens, which means we are left with around *2800* tokens for our context.
// So, we'll define our `threshold` as follows:
// ```
// while total context > 2500:
// Summarize largest chunk
//```
asyncfncreate_context(&self,qry: &str,data: &mut[(usize,String)],window: &Window,)-> Result<String>{Self::send_event(window,OpResult::Status(StatusData{head: "Context generation:".to_string(),body: "Generating final context".to_string(),..Default::default()}),).await?;letstart=Instant::now();let(muttotal_tokens,_,mutmax_token_idx)=self.compute_tokens(data).await?;// Tracking the number of summaries generated
letmutiter=0;whiletotal_tokens>Self::MAX_CTX_TOK{// Break if we have visited at-least data.len() of summaries
// Nothing more can be done with this
iter+=1;println!("Pre loop[{iter}]: {total_tokens}{max_token_idx}");ifiter>data.len(){break;}// This scope is required because the `.lock()` will block and next iterations of tokens will not be computed
{// We need to run a summarization pass for max tokens
letmutg=self.gen.lock().await;letgen=ifletSome(g)=g.as_mut(){g}else{returnErr(anyhow!("Generator not ready!"));};Self::send_event(window,OpResult::Status(StatusData{head: "Context generation: Summarizing a datapoint".to_string(),body: "Generating summary of a text chunk to fit it in context!".to_string(),..Default::default()}),).await?;letsummarystart=Instant::now();letsummary=gen.summarize(qry,&data.get(max_token_idx).unwrap().1)?;data[max_token_idx]=(data[max_token_idx].0,format!("## {}\n{}",summary.heading(),summary.summary()),);Self::send_event(window,OpResult::Status(StatusData{head: "Context generation: Datapoint summarized ".to_string(),time_s: Some((Instant::now()-summarystart).as_secs_f32()),..Default::default()}),).await?;}(total_tokens,_,max_token_idx)=self.compute_tokens(data).await?;println!("In loop[{iter}]: {total_tokens}{max_token_idx}");}println!("Begining context generation!");letctx=data.iter().map(|(idx,txt)|format!("Source: {idx}\n{}\n-------------\n",txt.trim())).collect::<Vec<_>>().join("").trim().to_string();Self::send_event(window,OpResult::Status(StatusData{head: "Context generated".to_string(),time_s: Some((Instant::now()-start).as_secs_f32()),..Default::default()}),).await?;Ok(ctx)}}
Well, guess we have everything we need to generate the answer! Time for the final method App::search( .. ) to glue all of these and run our search flow.
implApp{// code omitted
// Trigger the search flow - the search pipeline
asyncfnsearch(&self,qry: &str,cfg: &SearchConfig,res_send: &Window)-> Result<()>{letmutfinal_result=SearchResult{qry: qry.to_string(),..Default::default()};ifletErr(e)=self.ensure_generator(res_send).await{println!("App::search: error while loading LLaMA: {e:?}");Self::send_event(res_send,OpResult::Error("Error Loading Generator".to_string()),).await?;returnErr(anyhow!("Error Loading Generator"));}letsearch_start=Instant::now();// Step 1: query preprocessing
let(qry_more,q_txt,q_tensor)=matchself.query_preproc(qry,cfg.n_sub_qry,res_send).await{Ok(r)=>r,Err(e)=>{println!("App::search: error during sub query decomposition: {e:?}");Self::send_event(res_send,OpResult::Error("Error during subquery decomposition".to_string()),).await?;returnErr(anyhow!("Error during subquery decomposition"));}};// Step 2: Approximate nearest neighbor search
Self::send_event(res_send,OpResult::Status(StatusData{head: "Firing Approx. Nearest Neighbor search".to_string(),body: format!("<b>BM25:</b> {} | <b>ANN Cutoff:</b> {}",cfg.with_bm25,cfg.ann_cutoff.map_or(0.,|c|c)),..Default::default()}),).await?;letstore=self.store.read().await;let(res,elapsed)={letstart=Instant::now();letres=store.search(&q_tensor,&[qry_more.topic().to_string()],cfg.max_result,cfg.ann_cutoff,cfg.with_bm25,)?;(res,(Instant::now()-start).as_secs_f32())};Self::send_event(res_send,OpResult::Status(StatusData{head: format!("ANN Search yielded {} results",res.len()),body: String::new(),time_s: Some(elapsed),..Default::default()}),).await?;// Keep initial findings, if the search errors out
letmutres_map=HashMap::new();res.iter().for_each(|r|{res_map.insert(r.0,r.to_owned());});// Step 3: Check for relevance and re-rank
letrelevant=matchself.find_relevant(&q_txt,cfg.relevance_cutoff,&res[..],res_send).await{Ok(r)=>{ifr.is_empty(){res}else{r.iter().filter_map(|idx|{letdp=res_map.get(idx)?;Some(dp.to_owned())}).collect::<Vec<_>>()}}Err(e)=>{println!("App::search: error during relevance filtering: {e:?}");Self::send_event(res_send,OpResult::Error("Error during relevance filtering".to_string()),).await?;returnErr(anyhow!("Error during relevance filtering"));}};// Step 4: context augmentation - get adjacent data
letmutenhanced=matchself.k_adjacent(cfg.k_adjacent,&relevant[..],res_send).await{Ok(e)=>e,Err(e)=>{println!("App::search: error during fetching of {} adjacent: {e:?}",cfg.k_adjacent);Self::send_event(res_send,OpResult::Error("Error during context enhancement".to_string()),).await?;returnErr(anyhow!("Error during context enhancement"));}};// We have enhanced context now, let's summarize the context if needed
letqry_str=q_txt.join("\n");letctx=matchself.create_context(&qry_str,&mutenhanced[..],res_send).await{Ok(c)=>c,Err(e)=>{println!("App::search: generating context: {e:?}");Self::send_event(res_send,OpResult::Error("Error generating context".to_string()),).await?;returnErr(anyhow!("Error during context generation"));}};ifctx.is_empty()&&!cfg.allow_without_evidence{returnSelf::send_event(res_send,OpResult::Error("Nothing found!".to_string())).await;}// Step 5: Finally the answer
Self::send_event(res_send,OpResult::Status(StatusData{head: "Generating answer!".to_string(),body: String::new(),..Default::default()}),).await?;let(ans,elapsed)={letmutgen=self.gen.lock().await;letllm=ifletSome(gen)=gen.as_mut(){gen}else{returnErr(anyhow!("generator not found"));};letstart=Instant::now();letanswer=llm.answer(qry_more.topic(),qry_more.source(),&ctx)?;(answer,(Instant::now()-start).as_secs_f32())};Self::send_event(res_send,OpResult::Status(StatusData{head: "Finally, generated answer!".to_string(),body: String::new(),time_s: Some(elapsed),..Default::default()}),).await?;final_result.answer=ans.answer().to_string();ifctx.is_empty(){final_result.files=Vec::new();final_result.evidence=Vec::new();}else{letmutfile_list=HashSet::new();final_result.evidence=ans.evidence().iter().filter_map(|e|{letevidence=res_map.get(&e.index())?.1.file();let(file,page)=matchevidence{FileKind::Pdf((pth,pg))=>{file_list.insert(pth.to_owned());(pth.to_str()?.to_string(),Some(*pg))}FileKind::Text(pth)=>{file_list.insert(pth.to_owned());(pth.to_str()?.to_string(),None)}FileKind::Html(pth)=>{file_list.insert(pth.to_owned());(pth.to_str()?.to_string(),None)}};Some(Evidence{text: e.text().to_string(),file,page,})}).collect::<Vec<_>>();final_result.files=file_list.iter().filter_map(|f|f.to_str().map(|s|s.to_string())).collect::<Vec<_>>();}final_result.elapsed=(Instant::now()-search_start).as_secs_f32();Self::send_event(res_send,OpResult::Result(final_result)).await?;Ok(())}}
We are just calling the different methods we created incorporating the various techniques, the last part is just figuring out the source files and putting it as a part of the result!
Time to see what came out of our reasonably involved effort! Let’s run our app.
cargo tauri dev --release
Note
I’m leaving the client side implementation up to you!
This is in no way close to being Fool proof, in fact I have added a bunch of negative cases in the video where our QA misfires! Here are some quick observations from this experience:
Speed, putting all these techniques together often leads to unusably slow response! We could work on a way of dynamically deciding on what all techniques can be applied!
Accuracy, relying on LLMs for all the stuff that we are doing is still prone to hallucinations. E.g. in our example I asked the model to about Mozilla Firefox, now I know that the current dataset I’m using doesn’t contain anything about Firefox and it correctly pointed out that there are no references Firefox in the context, but it still sent across some random text as evidence!! Playing around with prompts - popularly referred to as Prompt Engineering might give us better and more deterministic results!
BM25 used for Fusion Retrieval works great when we are dealing with uncommon words! If you try to find with something like a news (even though we are removing some stop-words) the results go for a toss. We could possibly decide on the use of BM25 based on the incoming query instead of a pre-set configuration and possibly ask the LLM to generate the correct keywords for us!
JSON response from the LLM in this current implementation is prone to breaking! Something like a BNF Grammar is required to get this right!
Referencing source of information needs some work, our current implementation is rather naive!
We could create a conversational version of our QA!
The Store is not matured enough to handle deletions and re-insertions - you could build on that!
I’ll leave some of these implementations up to you!
Before we close
A word of caution before we close, this software is NOT READY FOR PRODUCTION, it has scope for tons of improvements and corner cases to be figured out!
Whether you’re excited to build, found a bug to squash, or just want to geek out about the possibilities - I’d love to hear from you! Drop me a line @beingAnubhab. And if this guide sparked your creativity, why not fork it and add your own magic? Your contributions could help others unlock even more potential. If you found this helpful, a quick share could inspire more developers and me. Let’s keep building amazing things together! 🚀