Part 5: Desktop App for Document QA with RAG - Techniques | Blog

In this blog series on building a Desktop Document QA app, we’ve tackled several key challenges: crafting document embeddings, building a Spotify’s ANNOY-inspired vector store, and developing a document Layout analysis and extraction pipeline powered by a Detection2 model. With indexing and basic QA flows now complete, we’re ready for the next phase of our journey - evaluating and implementing some cutting-edge techniques for a better RAG.

TL; DR

Github

Output

Note: This video has been sped up

Series Snapshot
Part 1: we implement Embedding generation from text data. We used Stella_en_1.5B_v5 and it’s Candle Transformers’ implementation as the embedding model and used the crate text-splitter to split our text into meaningful chunks.
Part 2: we build our own mini Vector Store inspired by Spotify’s ANNOY.
Part 3: we code up a pipeline to analyze and extract text from .pdf files and also set the foundation for text generation with a LLaMA Model.
Part 4: we work on the retrieve-and-answer flow from our corpus.
Part 5 (this): we implement and evaluate some techniques for a better RAG.

Better RAG

Taking inspiration from Nir Diamant’s comprehensive RAG Techniques reference, we’ll enhance our pipeline by implementing select methods from his catalog.

Techniques Implemented
Foundational Techniques
1. Simple Rag
3. Reliable Rag
4. Choose chunk size
5. Proposition Chunking
Query Enhancement
6. Query Transformation
Context & Content Enrichment
10. Context Enrichment
11. Semantic Chunking
12. Contextual Compression
Advanced Retrieval Techniques
14. Fusion Retrieval
15. Intelligent Reranking

Techniques Not Implemented
Foundational Techniques
2. CSV Rag: not working with CSV.
Query Enhancement
7. Hypothetical Questions (HyDE Approach): runtime cost intensive.
Context & Content Enrichment
8. Contextual Chunk Headers: Runtime cost intensive pre-embedding technique to augment a chunk with some context information like topic, keywords etc.
9. Relevant Segment Extraction: not evaluated yet!
13. Document Augmentation through Question Generation for Enhanced Retrieval - generate all possible questions of given document. Great for small corpus.
Advanced Retrieval Techniques
16. Multi-faceted Filtering
17. Hierarchical Indices - pre-embedding technique requiring LLM generated summaries - not ideal for our constrained resource
18. Ensemble Retrieval - requires multiple retrieval models, not ideal for limited resource
19. Multi-modal Retrieval - not the goal of this project yet!

Foundational RAG Techniques:

1. Simple RAG

Our implementation already covers the Simple RAG part of Nir’s reference and I’m skipping CSV because it’s not the goal of this project.

3. Reliable RAG

This technique is outlined as the following:

Document Relevancy Filtering: By using a binary relevancy score generated by a language model, only the most relevant documents are passed on to the answer generation phase, reducing noise and improving the quality of the final answer.
Hallucination Check: Before finalizing the answer, the system checks for hallucinations by verifying that the generated content is fully supported by the retrieved documents.
Snippet Highlighting: This feature enhances transparency by showing the exact segments from the retrieved documents that contributed to the final answer.
— RAG_Techniques

Let’s implement Document Relevancy Filtering, Snippet Highlighting and a form of Hallucination Check to ensure the Generator finds evidence from the context before answering.

Tradeoffs
For production systems consider the tradeoffs between accuracy and speed. Some of these techniques would effectively mean we make more LLM inference calls which will slow down our retrieval and increase runtime resource consumption.

For Document Relevancy Filtering we’ll add a method to our struct Generator where we prompt LLaMA to return the ids of the relevant text sections. This is akin to a binary filter, we are asking the LLM to only return the passages that are important while ignoring the rest.

src-tauri/src/gen.rs

331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
// code omitted ..
#[derive(Debug, Deserialize)]
pub struct Relevant {
    relevant: Vec<DocRelevance>,
}

#[derive(Debug, Deserialize)]
pub struct DocRelevance {
    id: usize,
    score: f32,
}

impl Relevant {
    fn to_list(&self) -> Vec<(usize, f32)> {
        self.relevant
            .iter()
            .map(|r| (r.id, r.score))
            .collect::<Vec<_>>()
    }
}

impl Generator {
    /// Given a set of queries and a set of documents, return a list of indices that are relevant to the queries
    pub fn find_relevant(
        &mut self,
        query: &[String],
        docs: &[(usize, String)],
    ) -> Result<Vec<(usize, f32)>> {
        let docfmt = docs
            .iter()
            .map(|(idx, txt)| format!("Id: {idx}\n{txt}\n-------------"))
            .collect::<Vec<_>>()
            .join("\n");

        let prompt = format!(
"<|start_header_id|>system<|end_header_id|>

You are an intelligent and dilligient AI who analyses text documents to figure out if a particular document contains relevant information for answering a set of queries. You must follow the given requirements while analysing and scoring the documents for your answer.<|eot_id|><|start_header_id|>user<|end_header_id|>

Documents:
```
{}
```

Queries:
```
- {}
```

Task:
Identify the ids of documents that are relevant for generating answers to the given queries and rate them in a scale of 1-10 where a score of 10 is most relevant.

Requirements:
- Only include ids of documents containing relevant information.
- If no documents are relevant the field \"relevant\" must be an empty array.
- Do not write any note, introduction, summary or justifications.
- Your answer must be a valid JSON of the following Schema.

Schema:
{{
    \"relevant\": Array<{{\"id\": numeric id, \"score\": numeric score}}>
}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{
    \"relevant\": [",
            docfmt,
            query.join("\n- ")
        );

        // println!("Relevance prompt:\n{prompt}");
        let tk = self.generate(&prompt)?;

        match serde_json::from_str::<Relevant>(format!("{{\n\t\"relevant\": [{tk}").as_str()) {
            Ok(d) => Ok(d.to_list()),
            Err(e) => {
                println!("Generator::find_relevant: error while deserializing: {e:?}\n{tk:?}\n");
                Err(anyhow!(e))
            }
        }
    }
}

Note that we are getting the model to generate scores in a range of 1 - 10 for each document chunk, this will come in effectively way of implementing the Intelligent Reranking.

In our previous post we had kept a scope for evidence in our answer prompt, time to use that for our Snippet Highlighting implementation.

src-tauri/src/gen.rs

185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
// code omitted ..

#[derive(Debug, Deserialize)]
pub struct Evidence {
    #[serde(rename = "source")]
    index: usize,
    text: String,
}

impl Evidence {
    pub fn index(&self) -> usize {
        self.index
    }

    pub fn text(&self) -> &str {
        &self.text
    }
}

#[derive(Debug, Deserialize)]
pub struct GeneratedAnswer {
    evidence: Vec<Evidence>,
    answer: String,
}

impl GeneratedAnswer {
    pub fn evidence(&self) -> &[Evidence] {
        &self.evidence[..]
    }

    pub fn answer(&self) -> &str {
        &self.answer
    }
}

// .. code omitted

We added a new struct Evidence to hold an index of the source text chunk along with the supported text during the generation pass. Inspecting our answer prompt in method answer( .. ) you’ll notice that we have already been asking the model to pick up the evidence, so no changes there.

This approach would already tackle hallucinations to some degree, however for a more complete implementation you’ll probably need to pass the response through a separate generation flow to specifically check for the validity of the generation.

4. Choose chunk size

Finding the optimal chunk size isn’t straightforward - it depends on your embedding model, vector store capabilities, and nature of your documents. While experimentation is key, our choice is guided by Stella_en_1.5B_v5’s training context of 512.

5. Proposition Chunking

Breaking down the text into concise, complete, meaningful sentences allowing for better control and handling of specific queries (especially extracting knowledge).
— RAG Techniques

Interestingly our text-splitter crate does a part of this for us. Text is broken down at the max logical boundary like sentences, paragraphs etc..

Query Enhancement

6. Query Transformations

The idea is to modify and/or expand a query to improve retrieval effectiveness by rewriting the original query, step-back prompting to generate broader queries or subqueries for a more holistic retrieval and sub-query decomposition where we can break down a complex query into simpler sub-queries.

Note
We’ll not always need to employ ALL of these techniques; choice of techniques should be governed by the problem at hand. Consider the tradeoffs and experiment with multiple techniques to figure out what works best!

Let’s start with step-back prompting extend to sub-query decomposition if and when we need to. To do this, we add a method to struct Generator preprocess user input query.

src-tauri/src/gen.rs

157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
// .. code omitted

/// A struct to hold `sub queries` and a `topic`
#[derive(Debug, Deserialize)]
pub struct QueryMore {
    #[serde(skip)]
    src: String,
    #[serde(rename = "sub_queries")]
    more: Vec<String>,
    topic: String,
}

impl QueryMore {
    pub fn source(&self) -> &str {
        &self.src
    }

    pub fn sub_queries(&self) -> &[String] {
        &self.more[..]
    }

    pub fn topic(&self) -> &str {
        &self.topic
    }

    pub fn queries(&self) -> Vec<String> {
        [&[self.source().to_string()], self.sub_queries()].concat()
    }
}

impl Generator {
    /// Preprocesses a query to generate `topic` and supplimental queries for `Fusion Retrieval`
    pub fn query_preproc(&mut self, query: &str, num_sub_qry: usize) -> Result<QueryMore> {
        let prompt = format!(
"<|start_header_id|>system<|end_header_id|>

You are a smart and intelligent AI assistant generating sub-queries and a topic for a Fusion Retrieval system based on a given source query. You always adhere to the given requirements.<|eot_id|><|start_header_id|>user<|end_header_id|>

Given a source query that may require additional context or specific information, generate relevant sub-queries to retrieve more accurate results. Identify a word or a very short phrase that represents the topic of the query.


Source Query:
{query}


Generate {num_sub_qry} relevant sub-queries that:
- Are closely related to the source query
- Can be used to retrieve additional context or specific information
- Are concise and clear


Requirements:
- Sub-queries should not repeat the source query
- Sub-queries should be relevant to the source query's intent, purpose and context
- use natural language for sub queries
- your answer should be a valid json of the following schema.


Schema:

{{
  sub_queries: Array<string>,
  topic: string
}}


Answer must be a valid json.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{
    \"sub_queries\": [\""
        );

        let tk = self.generate(&prompt)?;
        let mut res =
            serde_json::from_str::<QueryMore>(format!("{{\n  \"sub_queries\": [\"{tk}").as_str())?;
        res.src = query.to_string();

        Ok(res)
    }
}

Todo: write a test!
Test query_more( .. ) and edit the prompt if need be!

The following are the results from my test runs.

// calling method `fn sub_query(
//  "What are the impacts of climate change on the environment?",
//  4
//)`

QueryMore {
    src: "What are the impacts of climate change on the environment?",
    more: [
        "What are the effects of climate change on biodiversity?",
        "How does climate change affect sea levels?",
        "What are the economic impacts of climate change?",
        "How does climate change impact human migration?",
    ],
    topic: "climate change"
}

// calling method `fn sub_query(
//  "What are the latest news about Iraq?",
//  4
//)`

QueryMore {
    src: "What are the latest news about Iraq?",
    more: [
        "Iraqi government updates",
        "Latest news on ISIS in Iraq",
        "Iraq news and current events",
        "Humanitarian situation in Iraq",
    ],
    topic: "Iraq"
}

Neat … our base query has been expanded into related queries which is very likely to yield better results!!

Context & Content Enrichment

10. Context Enrichment Technique

Here we retrieve the relevant chunk along with its neighbors to provide a better context to the Generator. We have already implemented a flavor of this when we used the overlap parameter in text splitting, but that was while generating the embeddings. Let’s extend the concept such that the final context generated contains k adjacent chunks.

We need a way of getting adjacent text blocks in the same file, but because of the overlap we’ll also need to remove the duplicate between two chunks.

First, a utility to deduplicate text.

src-tauri/src/utils.rs

17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/// The end text of prev would be common with the begining of current
/// E.g.
/// prev: Hello, how are you? Life is good!
/// current: Life is good! The act of creation keeps us buzy!
// The strategy is simple, we'll pick up the midpoint of current and keep checking backwards if prev ends with the text
// This will be pretty efficient because with max token size of 512 and a known overlap factor which is 1/4th of the token size
// this would yield reasonable results
pub fn dedup_text(prev: &str, current: &str) -> Result<String> {
    let cur = current.as_bytes();
    let prv = prev.as_bytes();

    let mut pointer = cur.len() / 2;

    while pointer > 0 {
        if prv.ends_with(&cur[0..pointer]) {
            break;
        }
        pointer -= 1;
    }

    Ok(std::str::from_utf8(&prv[0 .. prv.len() - pointer])?.to_string())
}

Now, a method to struct Store to return k adjacent chunks before and after the selected index.

src-tauri/src/store.rs

365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
impl Store {
    // code omitted ..

    /// Given an index `idx` returns `k` adjacent chunks before and after the index
    /// Returns k text blocks before with overlap removed, the current text with overlap removed and k text blocks after, again overlap removed
    pub fn with_k_adjacent(
        &self,
        idx: usize,
        k: usize,
    ) -> Result<(Vec<String>, String, Vec<String>)> {
        // Let's collect all indices that need to be fethed
        // We have to ensure indices that are in the SAME source file
        let start = idx.saturating_sub(k);
        let end = (idx + k + 1).min(self.data.len());

        let trg_data = if let Some(d) = self.data.get(idx) {
            d
        } else {
            eprintln!("Nothing found for index {idx}. Corrupt store!");
            return Err(anyhow!("corrupt store!"));
        };

        let trg_src = match &trg_data.file {
            FileKind::Text(p) => p.as_path(),
            FileKind::Pdf((p, _)) => p.as_path(),
            FileKind::Html(p) => p.as_path(),
        };

        let mut chunks: Vec<(String, usize)> = Vec::with_capacity(end - start);

        (start..end).for_each(|index| {
            let data = if index == idx {
                trg_data
            } else if let Some(d) = self.data.get(index) {
                d
            } else {
                eprintln!("Nothing found for data point {index}");
                return;
            };

            let src = match &data.file {
                FileKind::Text(p) => p.as_path(),
                FileKind::Pdf((p, _)) => p.as_path(),
                FileKind::Html(p) => p.as_path(),
            };

            // Not neighbors if indices are not from the same source file
            if src != trg_src {
                return;
            }

            let txt = if let Ok(txt) = self.chunk(data) {
                txt
            } else {
                return;
            };

            if !chunks.is_empty() {
                let i = chunks.len() - 1;
                chunks[i].0 = if let Ok(t) = dedup_text(&chunks[i].0, &txt) {
                    t
                } else {
                    return;
                }
            }

            chunks.push((txt, index));
        });

        // We have deduplicated text, let's prepare them in the before after kind of structure
        let mut result = (vec![], String::new(), vec![]);

        chunks.into_iter().for_each(|(s, i)| match i.cmp(&idx) {
            Ordering::Less => result.0.push(s),
            Ordering::Equal => result.1 = s,
            Ordering::Greater => result.2.push(s),
        });

        Ok(result)
    }

    /// Given a datapoint, returns the text chunk for that datapoint
    pub fn chunk(&self, data: &Data) -> Result<String> {
        let df = if let Some(df) = self.data_file.as_ref() {
            df
        } else {
            return Err(anyhow!("Store not initialized!"));
        };

        let mut f = df
            .lock()
            .map_err(|e| anyhow!("error acquiring data file lock: {e:?}"))?;
        f.seek(std::io::SeekFrom::Start(data.start as u64))?;

        let mut buf = vec![0; data.length];
        f.read_exact(&mut buf)?;

        String::from_utf8(buf).map_err(|e| anyhow!(e))
    }
}

We figure out the start and end indices based on the given index and k. Then we fetch the relevant chunk, deduplicate them before returning.

11. Semantic Chunking

The idea behind semantic chunking to break the text into logical blocks rather than fixed size chunks. The definition of logical is up to us.

In our case text-splitter already does semantic-chunking for us, and here’s how they go about deciding on semantic boundaries.

12. Contextual Compression

To deal with our search context exploding we compress retrieved information while preserving query-relevant content. We decide on text chunks to compress through summarization based on some heuristics and then we call the LLM to generate summary such that the context ad query related information is not lost.

Let’s add a method to struct Generator for summarization:

src-file/src/gen.rs

411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
// code omitted ..

#[derive(Debug, Deserialize)]
pub struct Summary {
    heading: String,
    summary: String,
}

impl Summary {
    pub fn summary(&self) -> &str {
        &self.summary
    }

    pub fn heading(&self) -> &str {
        &self.heading
    }
}

impl Generator {
    /// Generates summaries of given text
    pub fn summarize(&mut self, queries: &str, context: &str) -> Result<Summary> {
        let prompt = format!(
"<|start_header_id|>system<|end_header_id|>

You are a smart and intelligent AI assistant generating a heading and summary of a given data so that it can be used for answering the user queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Queries:
```
{queries}
```

Data:
```
{context}
```


Generate a short summary and a heading for the given data that:
- Reflects the essence, tone and information of the data
- Retains all key facts
- Are concise and clear
- Can be used as evidence to answer given queries


Requirements:
- Heading should reflect the topic and essence of the data
- Summary and heading should be relevant to the source data's intent, purpose and context
- use natural language for summary
- All key facts should be retained
- Summary should not be more than 350 words


Schema:
{{
    heading: string,
    summary: string
}}


Answer must be a valid json.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{
    \"heading\": \""
        );

        let tk = self.generate(&prompt)?;

        serde_json::from_str::<Summary>(format!("{{\n   \"heading\": \"{tk}").as_str())
            .map_err(|e| anyhow!(e))
    }
}

We’ll leave the decision to call the summarization to the final method App::search( .. ).

Advanced Retrieval Methods

14. Fusion Retrieval

In Fusion Retrieval we’ll run a keyword search in parallel with our semantic search and factor in results from both searches for final results.

For the keyword search we’ll use bm25 ranking function.

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters.
— Wikipedia

Thankfully, there’s a great rust implementation. Let’s add the bm25 crate to our dependencies with features = ["parallelism"].

Now, bm25 indexing needs to be a part of our struct Store.

src-tauri/src/store.rs

31
32
33
34
35
36
37
38
39
40
41
42
43
44
// imports omitted ..

#[derive(Serialize, Deserialize, Default)]
pub struct Store {
    data: Vec<Data>,
    dir: PathBuf,
    text_size: usize,
    #[serde(skip)]
    data_file: Option<Arc<Mutex<BufReader<File>>>>,
    #[serde(skip)]
    index: Option<ANNIndex>,
    #[serde(skip)]
    bm25: Option<SearchEngine<usize>>
}

bm25 index must also be built while building indices.

src-tauri/src/store.rs

156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
impl Store {
    // code omitted

    // We break apart the index builders to separate functions and build the ANN and BM25 index in parallel
    fn build_index(&mut self, num_trees: usize, max_size: usize) -> Result<()> {
        let (ann, bm25) = rayon::join(
            || {
                Self::build_ann(
                    &self.dir.join(EMBED_FILE),
                    num_trees,
                    max_size,
                    self.data.len(),
                )
            },
            || {
                let docs = self
                    .data
                    .iter()
                    .enumerate()
                    .filter_map(|(idx, d)| {
                        let chunk = match self.chunk(d) {
                            Ok(c) => c,
                            Err(e) => {
                                eprintln!("Error while reading chunk: {e:?}");
                                return None;
                            }
                        };

                        Some(Document {
                            id: idx,
                            contents: chunk,
                        })
                    })
                    .collect::<Vec<_>>();

                Self::build_bm25(docs)
            },
        );

        self.index = Some(ann?);
        self.bm25 = Some(bm25?);

        Ok(())
    }

    // Builds the BM25 index
    fn build_bm25(docs: Vec<Document<usize>>) -> Result<SearchEngine<usize>> {
        let engine = SearchEngineBuilder::<usize>::with_documents(Language::English, docs).build();

        Ok(engine)
    }
}

Finally, our method search( .. ) needs modifications to incorporate a parallel Nearest Neighbor and BM25 index lookup.

src-tauri/src/store.rs

227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
impl Store {
    // code omitted ..

    /// API for search into the index
    pub fn search(
        &self,
        qry: &[Tensor],
        qry_str: &[String],
        top_k: usize,
        ann_cutoff: Option<f32>,
        with_bm25: bool,
    ) -> Result<Vec<(usize, &Data, String, f32)>> {
        // Giving 75% weightage to the ANN search and 25% to BM25 search
        const ALPHA: f32 = 0.75;

        // Let's get the ANN scores and BM25 scores in parallel
        let (ann, bm25) = rayon::join(
            || {
                let ann = DashMap::new();
                if let Some(index) = &self.index {
                    qry.par_iter().for_each(|q| {
                        let res = match index.search_approximate(q, top_k * 4, ann_cutoff) {
                            Ok(d) => d,
                            Err(e) => {
                                eprintln!("Error in search_approximate: {e}");
                                return;
                            }
                        };

                        res.iter().for_each(|(idx, score)| {
                            let idx = *idx;
                            if let Some(d) = self.data.get(idx) {
                                let txt = if let Ok(c) = self.chunk(d) {
                                    c
                                } else {
                                    return;
                                };

                                let mut e = ann.entry(idx).or_insert((d, txt, *score));
                                if e.2 < *score {
                                    e.2 = *score;
                                }
                            }
                        });
                    });
                }

                ann
            },
            || {
                if !with_bm25 {
                    return None;
                }
                let bm25 = DashMap::new();
                if let Some(b) = self.bm25.as_ref() {
                    qry_str.par_iter().for_each(|qs| {
                        let res = b.search(qs, top_k * 4);
                        res.par_iter().for_each(|r| {
                            let mut e = bm25.entry(r.document.id).or_insert(r.score);

                            if *e < r.score {
                                *e = r.score;
                            }
                        });
                    });
                };

                Some(bm25)
            },
        );

        // Now, we have the highest ANN and BM25 scores for the set of queries
        // We'll need to create a `combined` score of the two
        // Based on https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques_runnable_scripts/fusion_retrieval.py
        // the steps are:
        // 1. Normalize the vector search score
        // 2. Normalize the bm25 score
        // 3. combined_scores = some alpha * vector_scores + (1 - alpha) * bm25_scores

        // To normalize the ANN Scores, let's go ahead and get the Max/ Min
        let mut ann_max = 0_f32;
        let mut ann_min = f32::MAX;

        ann.iter().for_each(|j| {
            ann_max = j.2.max(ann_max);
            ann_min = j.2.min(ann_min);
        });

        let ann_div = ann_max - ann_min;

        // And same for bm25 scores
        let mut bm25_max = 0_f32;
        let mut bm25_min = f32::MAX;

        let has_bm_25 = bm25.as_ref().map_or(false, |b| b.is_empty());

        let bm25_div = if has_bm_25 {
            if let Some(b) = bm25.as_ref() {
                b.iter().for_each(|j| {
                    bm25_max = j.max(bm25_max);
                    bm25_min = j.min(bm25_min);
                });

                bm25_max - bm25_min
            } else {
                f32::MIN
            }
        } else {
            f32::MIN
        };

        // Ok, to time to normalize our scores and create a combined score for each of them
        let mut combined = ann
            .par_iter()
            .map(|j| {
                let id = *j.key();
                let ann_score = 1. - (j.2 - ann_min) / ann_div;
                let bm25_score = if has_bm_25 {
                    let bm25_score = if let Some(b) = bm25.as_ref().and_then(|b| b.get(&id)) {
                        (*b - bm25_min) / bm25_div
                    } else {
                        // Some very small number if not present
                        0.
                    };

                    bm25_score
                } else {
                    0.
                };

                let combined = ALPHA * ann_score + (1. - ALPHA) * bm25_score;

                (id, j.0, j.1.clone(), combined)
            })
            .collect::<Vec<_>>();

        combined.par_sort_unstable_by(|a, b| b.3.total_cmp(&a.3));

        Ok(combined[0..top_k.min(combined.len())].to_vec())
    }
}

Go ahead and modify our existing test-case to get this to work.

15. Intelligent Reranking

This is the final technique in our consideration, this involves generating a relevance score for every retrieved document against the source query.

There are 2 major ways of achieving this:

LLM Based Scoring: where we send the retrieved documents to a LLM and ask it to generate a relevancy score for each document against the source query.
Cross Encoders: a model which outputs a sentence similarity between 2 input texts.

LLM Based Scoring is simple enough to implement - effectively we just prompt the model to generate a score in a scale of 1-10 for each document. Note that we are already doing this with our method Generator::find_relevant( .. ). We sort the result on the scores for the Re-ranked context.

Stitching it all together!

RAG is like an orchestra - we have all our instruments ready but now comes the art of conducting them in harmony. While each technique brings its own strengths, we need to fine-tune our search configurations to strike the perfect balance between lightning-fast responses and pinpoint accuracy, without letting our model become too rigid or too vague in its outputs.

src-tauri/src/app.rs

248
249
250
251
252
253
254
255
256
257
258
259
// code omitted ..

#[derive(Debug, Deserialize)]
pub struct SearchConfig {
    with_bm25: bool,
    allow_without_evidence: bool,
    max_result: usize,
    ann_cutoff: Option<f32>,
    n_sub_qry: usize,
    k_adjacent: usize,
    relevance_cutoff: f32,
}

Now, based on these configurations we’ll modify the method App::search( .. ) to switch specific parts of the flow on or off at will.

Let’s write a helper function to ensure that the generator is loaded before the search!

src-tauri/src/app.rs

287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
impl App {
    // code omitted ..

    // A utility to ensure LLM is loaded
    async fn ensure_generator(&self, window: &Window) -> Result<()> {
        let mut gen = self.gen.lock().await;
        if gen.is_none() {
            Self::send_event(
                window,
                OpResult::Status(StatusData {
                    head: "Loading LLaMA ..".to_string(),
                    body: String::new(),
                    ..Default::default()
                }),
            )
            .await?;

            let start = Instant::now();
            *gen = Some(Generator::new(&self.modeldir, &select_device()?)?);
            Self::send_event(
                window,
                OpResult::Status(StatusData {
                    head: "LLaMA Ready".to_string(),
                    body: String::new(),
                    time_s: Some((Instant::now() - start).as_secs_f32()),
                    ..Default::default()
                }),
            )
            .await?;
        }
        Ok(())
    }
}

Now a method to execute the subquery decomposition flow when SearchConfig::n_sub_qry > 0.

src-tauri/src/app.rs

317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
impl App {
    // code omitted..

    // Utility to run query preprocessing and sub query decomposition
    async fn query_preproc(&self, qry: &str, n_subqry: usize, window: &Window) -> Result<(QueryMore, Vec<String>, Vec<Tensor>)> {
        Self::send_event(
            window,
            OpResult::Status(StatusData {
                head: "Subquery decomposition".to_string(),
                body: format!("Generating {} subqueries..", n_subqry),
                ..Default::default()
            }),
        )
        .await?;

        let (qry_more, start) = if n_subqry == 0 {
            (QueryMore::new(qry), Instant::now())
        } else {
            let mut gen = self.gen.lock().await;
            let llm = if let Some(gen) = gen.as_mut() {
                gen
            } else {
                eprintln!("App::search: LLaMA not loaded!");
                return Err(anyhow!("Error Loading Generator"));
            };

            (llm.query_preproc(qry, n_subqry)?, Instant::now())
        };

        let (q_txt, q_tensor) = {
            let queries = qry_more.queries();
            let mut emb = self.embed.lock().await;
            let t = emb.query(&queries)?;

            let tensor = (0..queries.len())
                .map(|i| {
                    t.i(i)
                        .unwrap()
                        .to_device(&candle_core::Device::Cpu)
                        .unwrap()
                        .unsqueeze(0)
                        .unwrap()
                })
                .collect::<Vec<_>>();

            (queries, tensor)
        };

        Self::send_event(
            window,
            OpResult::Status(StatusData {
                head: format!("Generated {} subqueries", qry_more.sub_queries().len()),
                body: if n_subqry > 0 {
                    format!(
                        "<b>Topic:</b>\n{}\n\n<b>Subqueries:</b>\n- {}",
                        qry_more.topic(),
                        qry_more.sub_queries().join("\n- ")
                    )
                } else { String::new() },
                time_s: Some(
                    (Instant::now() - start).as_secs_f32()
                ),
                ..Default::default()
            }),
        )
        .await?;
        
        Ok((qry_more, q_txt, q_tensor))
    }
}

Next, we write a helper method to fire document relevance and re-rank flow.

src-tauri/src/app.rs

393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
impl App {
    // code omitted

    // A function to run the `relevance` pass
    async fn find_relevant(
        &self,
        qry: &[String],
        cutoff: f32,
        res: &[StoreDataRepr<'_>],
        window: &Window,
    ) -> Result<Vec<usize>> {
        // A relevance cutoff greater than `0` means we have activated the flow
        if cutoff == 0. {
            return Ok(Vec::new());
        }

        Self::send_event(
            window,
            OpResult::Status(StatusData {
                head: "Starting `Relevance` and Re-ranking pass".to_string(),
                body: format!("Relevance cutoff: {}", cutoff),
                ..Default::default()
            }),
        )
        .await?;

        let mut gen = self.gen.lock().await;
        let llm = if let Some(gen) = gen.as_mut() {
            gen
        } else {
            return Err(anyhow!("generator not found"));
        };

        let start = Instant::now();
        // Sometimes the LLM ends up returning duplicates, this is to clean them out
        let mut unq = HashSet::new();

        // If we send ALL our response, we'll probably run out of context length
        // So, let's chunk this
        let mut relevant = res
            .chunks(8)
            .filter_map(|c| {
                let batched = c.par_iter().map(|k| (k.0, k.2.clone())).collect::<Vec<_>>();
                llm.find_relevant(qry, &batched).ok()
            })
            .flatten()
            .filter(|(idx, score)| {
                if unq.contains(idx) || *score < cutoff {
                    false
                } else {
                    unq.insert(*idx);
                    true
                }
            })
            .collect::<Vec<_>>();

        relevant.par_sort_by(|a, b| b.1.total_cmp(&a.1));

        // println!("Relevant: {relevant:?}");

        Self::send_event(
            window,
            OpResult::Status(StatusData {
                head: format!("Filtered {} relevant results", relevant.len()),
                body: String::new(),
                time_s: Some((Instant::now() - start).as_secs_f32()),
                ..Default::default()
            }),
        )
        .await?;

        Ok(relevant.iter().map(|(idx, _)| *idx).collect::<Vec<_>>())
    }
}

Great, so that will return a set of filtered and re-ranked results if the relevance cutoff is > 0.0.

Now it’s time to put together the function to get k adjacent chunks for context enrichment.

src-tauri/src/app.ra

464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
impl App {
    // code omitted ..

    // Rethurns k_adjacent text if its > 0
    async fn k_adjacent(
        &self,
        k_adjacent: usize,
        data: &[StoreDataRepr<'_>],
        window: &Window,
    ) -> Result<Vec<(usize, String)>> {
        if k_adjacent == 0 {
            return Ok(data
                .iter()
                .map(|(idx, _, txt, _)| (*idx, txt.to_owned()))
                .collect::<Vec<_>>());
        }

        Self::send_event(
            window,
            OpResult::Status(StatusData {
                head: "Context enhancement: Expanding search context".to_string(),
                body: format!("<i>K</i> Adjacent: {}", k_adjacent),
                ..Default::default()
            }),
        )
        .await?;

        let store = self.store.read().await;
        let start = Instant::now();
        let enhanced = data
            .iter()
            .filter_map(|(idx, _, _, _)| {
                let a = store.with_k_adjacent(*idx, k_adjacent).ok()?;
                let txt = [a.0.join("\n").as_str(), &a.1, a.2.join("\n").as_str()].join("\n\n");

                Some((*idx, txt))
            })
            .collect::<Vec<_>>();

        Self::send_event(
            window,
            OpResult::Status(StatusData {
                head: format!("Context enhanced with {k_adjacent} adjacent"),
                body: String::new(),
                time_s: Some((Instant::now() - start).as_secs_f32()),
                ..Default::default()
            }),
        )
        .await?;

        Ok(enhanced)
    }
}

Finally we’ll generate the final context for our search and summarize parts of the context that cross a certain threshold.

Summarization Threshold
So we have a total context length of 4096 tokens - I’ll leave 1/4th of that for the final answer which leaves us with 3072 tokens for our context (including the system prompt). Now, our default prompt without any additional data takes ~250 tokens, which means we are left with around 2800 tokens for our context.
So, we’ll define our threshold as follows:
while total context > 2800:
    Summarize largest chunk

Let’s put together a helper function to canculate tokens for the current context.

src-tauri/src/app.rs

621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
impl App {
    // code omitted

    // Computes `token` count related information of given text
    // returns: (total_tokens, max_tokens, max_token_idx)
    async fn compute_tokens(&self, data: &[(usize, String)]) -> Result<(usize, usize, usize)> {
        let mut g = self.gen.lock().await;
        let gen = if let Some(g) = g.as_mut() {
            g
        } else {
            return Err(anyhow!("Generator not ready!"));
        };

        // Total size of encoded tokens
        let mut total_tokens = 0;
        // Chunk with max tokens
        let mut max_token_idx = 0;
        let mut max_tokens = 0;

        data.iter().enumerate().for_each(|(i, (_, txt))| {
            let tokenized = gen.tokenize(txt).unwrap().len();
            total_tokens += tokenized;
            if tokenized > max_tokens {
                max_token_idx = i;
                max_tokens = tokenized;
            }
        });

        Ok((total_tokens, max_tokens, max_token_idx))
    }
}

Then, the function to create the context for us - it will use the method App::compute_tokens( .. ) to calculate the total context size to decide and execute the summarization( .. ) pass if required.

src-tauri/src/app.rs

514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
impl App {
    // code omitted

    // So we have a total context length of *4096* tokens
    // leave 1/4th of that for the final *answer* which leaves us with *3072* tokens for our *context* (including the system prompt).
    // Now, our default prompt without any aditional data takes *~250* tokens, which means we are left with around *2800* tokens for our context.
    // So, we'll define our `threshold` as follows:
    // ```
    // while total context > 2500:
    //  Summarize largest chunk
    //```
    async fn create_context(
        &self,
        qry: &str,
        data: &mut [(usize, String)],
        window: &Window,
    ) -> Result<String> {
        Self::send_event(
            window,
            OpResult::Status(StatusData {
                head: "Context generation:".to_string(),
                body: "Generating final context".to_string(),
                ..Default::default()
            }),
        )
        .await?;

        let start = Instant::now();

        let (mut total_tokens, _, mut max_token_idx) = self.compute_tokens(data).await?;

        // Tracking the number of summaries generated
        let mut iter = 0;

        while total_tokens > Self::MAX_CTX_TOK {
            // Break if we have visited at-least data.len() of summaries
            // Nothing more can be done with this
            iter += 1;
            println!("Pre loop[{iter}]:  {total_tokens} {max_token_idx}");
            if iter > data.len() {
                break;
            }

            // This scope is required because the `.lock()` will block and next iterations of tokens will not be computed
            {
                // We need to run a summarization pass for max tokens
                let mut g = self.gen.lock().await;
                let gen = if let Some(g) = g.as_mut() {
                    g
                } else {
                    return Err(anyhow!("Generator not ready!"));
                };

                Self::send_event(
                    window,
                    OpResult::Status(StatusData {
                        head: "Context generation: Summarizing a datapoint".to_string(),
                        body: "Generating summary of a text chunk to fit it in context!"
                            .to_string(),
                        ..Default::default()
                    }),
                )
                .await?;

                let summarystart = Instant::now();
                let summary = gen.summarize(qry, &data.get(max_token_idx).unwrap().1)?;

                data[max_token_idx] = (
                    data[max_token_idx].0,
                    format!("## {}\n{}", summary.heading(), summary.summary()),
                );

                Self::send_event(
                    window,
                    OpResult::Status(StatusData {
                        head: "Context generation: Datapoint summarized ".to_string(),
                        time_s: Some((Instant::now() - summarystart).as_secs_f32()),
                        ..Default::default()
                    }),
                )
                .await?;
            }

            (total_tokens, _, max_token_idx) = self.compute_tokens(data).await?;
            println!("In loop[{iter}]: {total_tokens} {max_token_idx}");
        }

        println!("Begining context generation!");

        let ctx = data
            .iter()
            .map(|(idx, txt)| format!("Source: {idx}\n{}\n-------------\n", txt.trim()))
            .collect::<Vec<_>>()
            .join("")
            .trim()
            .to_string();

        Self::send_event(
            window,
            OpResult::Status(StatusData {
                head: "Context generated".to_string(),
                time_s: Some((Instant::now() - start).as_secs_f32()),
                ..Default::default()
            }),
        )
        .await?;

        Ok(ctx)
    }
}

Well, guess we have everything we need to generate the answer! Time for the final method App::search( .. ) to glue all of these and run our search flow.

src-tauri/src/app.rs

649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
impl App {
    // code omitted

    // Trigger the search flow - the search pipeline
    async fn search(&self, qry: &str, cfg: &SearchConfig, res_send: &Window) -> Result<()> {
        let mut final_result = SearchResult {
            qry: qry.to_string(),
            ..Default::default()
        };

        if let Err(e) = self.ensure_generator(res_send).await {
            println!("App::search: error while loading LLaMA: {e:?}");
            Self::send_event(
                res_send,
                OpResult::Error("Error Loading Generator".to_string()),
            )
            .await?;

            return Err(anyhow!("Error Loading Generator"));
        }

        let search_start = Instant::now();

        // Step 1: query preprocessing
        let (qry_more, q_txt, q_tensor) =
            match self.query_preproc(qry, cfg.n_sub_qry, res_send).await {
                Ok(r) => r,
                Err(e) => {
                    println!("App::search: error during sub query decomposition: {e:?}");

                    Self::send_event(
                        res_send,
                        OpResult::Error("Error during subquery decomposition".to_string()),
                    )
                    .await?;

                    return Err(anyhow!("Error during subquery decomposition"));
                }
            };

        // Step 2: Approximate nearest neighbor search
        Self::send_event(
            res_send,
            OpResult::Status(StatusData {
                head: "Firing Approx. Nearest Neighbor search".to_string(),
                body: format!(
                    "<b>BM25:</b> {} | <b>ANN Cutoff:</b> {}",
                    cfg.with_bm25,
                    cfg.ann_cutoff.map_or(0., |c| c)
                ),
                ..Default::default()
            }),
        )
        .await?;

        let store = self.store.read().await;
        let (res, elapsed) = {
            let start = Instant::now();

            let res = store.search(
                &q_tensor,
                &[qry_more.topic().to_string()],
                cfg.max_result,
                cfg.ann_cutoff,
                cfg.with_bm25,
            )?;

            (res, (Instant::now() - start).as_secs_f32())
        };

        Self::send_event(
            res_send,
            OpResult::Status(StatusData {
                head: format!("ANN Search yielded {} results", res.len()),
                body: String::new(),
                time_s: Some(elapsed),
                ..Default::default()
            }),
        )
        .await?;

        // Keep initial findings, if the search errors out
        let mut res_map = HashMap::new();
        res.iter().for_each(|r| {
            res_map.insert(r.0, r.to_owned());
        });

        // Step 3: Check for relevance and re-rank
        let relevant = match self
            .find_relevant(&q_txt, cfg.relevance_cutoff, &res[..], res_send)
            .await
        {
            Ok(r) => {
                if r.is_empty() {
                    res
                } else {
                    r.iter()
                        .filter_map(|idx| {
                            let dp = res_map.get(idx)?;
                            Some(dp.to_owned())
                        })
                        .collect::<Vec<_>>()
                }
            }
            Err(e) => {
                println!("App::search: error during relevance filtering: {e:?}");

                Self::send_event(
                    res_send,
                    OpResult::Error("Error during relevance filtering".to_string()),
                )
                .await?;

                return Err(anyhow!("Error during relevance filtering"));
            }
        };

        // Step 4: context augmentation - get adjacent data
        let mut enhanced = match self
            .k_adjacent(cfg.k_adjacent, &relevant[..], res_send)
            .await
        {
            Ok(e) => e,
            Err(e) => {
                println!(
                    "App::search: error during fetching of {} adjacent: {e:?}",
                    cfg.k_adjacent
                );

                Self::send_event(
                    res_send,
                    OpResult::Error("Error during context enhancement".to_string()),
                )
                .await?;

                return Err(anyhow!("Error during context enhancement"));
            }
        };

        // We have enhanced context now, let's summarize the context if needed
        let qry_str = q_txt.join("\n");
        let ctx = match self
            .create_context(&qry_str, &mut enhanced[..], res_send)
            .await
        {
            Ok(c) => c,
            Err(e) => {
                println!("App::search: generating context: {e:?}");
                Self::send_event(
                    res_send,
                    OpResult::Error("Error generating context".to_string()),
                )
                .await?;

                return Err(anyhow!("Error during context generation"));
            }
        };

        if ctx.is_empty() && !cfg.allow_without_evidence {
            return Self::send_event(res_send, OpResult::Error("Nothing found!".to_string())).await;
        }

        // Step 5: Finally the answer
        Self::send_event(
            res_send,
            OpResult::Status(StatusData {
                head: "Generating answer!".to_string(),
                body: String::new(),
                ..Default::default()
            }),
        )
        .await?;

        let (ans, elapsed) = {
            let mut gen = self.gen.lock().await;
            let llm = if let Some(gen) = gen.as_mut() {
                gen
            } else {
                return Err(anyhow!("generator not found"));
            };

            let start = Instant::now();
            let answer = llm.answer(qry_more.topic(), qry_more.source(), &ctx)?;

            (answer, (Instant::now() - start).as_secs_f32())
        };

        Self::send_event(
            res_send,
            OpResult::Status(StatusData {
                head: "Finally, generated answer!".to_string(),
                body: String::new(),
                time_s: Some(elapsed),
                ..Default::default()
            }),
        )
        .await?;

        final_result.answer = ans.answer().to_string();

        if ctx.is_empty() {
            final_result.files = Vec::new();
            final_result.evidence = Vec::new();
        } else {
            let mut file_list = HashSet::new();
            final_result.evidence = ans
                .evidence()
                .iter()
                .filter_map(|e| {
                    let evidence = res_map.get(&e.index())?.1.file();
                    let (file, page) = match evidence {
                        FileKind::Pdf((pth, pg)) => {
                            file_list.insert(pth.to_owned());
                            (pth.to_str()?.to_string(), Some(*pg))
                        }
                        FileKind::Text(pth) => {
                            file_list.insert(pth.to_owned());
                            (pth.to_str()?.to_string(), None)
                        }
                        FileKind::Html(pth) => {
                            file_list.insert(pth.to_owned());
                            (pth.to_str()?.to_string(), None)
                        }
                    };

                    Some(Evidence {
                        text: e.text().to_string(),
                        file,
                        page,
                    })
                })
                .collect::<Vec<_>>();

            final_result.files = file_list
                .iter()
                .filter_map(|f| f.to_str().map(|s| s.to_string()))
                .collect::<Vec<_>>();
        }

        final_result.elapsed = (Instant::now() - search_start).as_secs_f32();

        Self::send_event(res_send, OpResult::Result(final_result)).await?;

        Ok(())
    }
}

We are just calling the different methods we created incorporating the various techniques, the last part is just figuring out the source files and putting it as a part of the result!

Time to see what came out of our reasonably involved effort! Let’s run our app.

cargo tauri dev --release

Note
I’m leaving the client side implementation up to you!
The following video has been sped up!

That’s it, our very own Local RAG desktop app.

Observations and next steps:

This is in no way close to being Fool proof, in fact I have added a bunch of negative cases in the video where our QA misfires! Here are some quick observations from this experience:

Speed, putting all these techniques together often leads to unusably slow response! We could work on a way of dynamically deciding on what all techniques can be applied!
Accuracy, relying on LLMs for all the stuff that we are doing is still prone to hallucinations. E.g. in our example I asked the model to about Mozilla Firefox, now I know that the current dataset I’m using doesn’t contain anything about Firefox and it correctly pointed out that there are no references Firefox in the context, but it still sent across some random text as evidence!! Playing around with prompts - popularly referred to as Prompt Engineering might give us better and more deterministic results!
BM25 used for Fusion Retrieval works great when we are dealing with uncommon words! If you try to find with something like a news (even though we are removing some stop-words) the results go for a toss. We could possibly decide on the use of BM25 based on the incoming query instead of a pre-set configuration and possibly ask the LLM to generate the correct keywords for us!
JSON response from the LLM in this current implementation is prone to breaking! Something like a BNF Grammar is required to get this right!
Referencing source of information needs some work, our current implementation is rather naive!
We could create a conversational version of our QA!
The Store is not matured enough to handle deletions and re-insertions - you could build on that!

I’ll leave some of these implementations up to you!

Before we close
A word of caution before we close, this software is NOT READY FOR PRODUCTION, it has scope for tons of improvements and corner cases to be figured out!

Conclusion

Whether you’re excited to build, found a bug to squash, or just want to geek out about the possibilities - I’d love to hear from you! Drop me a line @beingAnubhab. And if this guide sparked your creativity, why not fork it and add your own magic? Your contributions could help others unlock even more potential. If you found this helpful, a quick share could inspire more developers and me. Let’s keep building amazing things together! 🚀

TL; DR#

Output#

Better RAG#

Foundational RAG Techniques:#

1. Simple RAG#

Stitching it all together!#

Observations and next steps:#

Conclusion#

TL; DR

Output

Better RAG

Foundational RAG Techniques:

1. Simple RAG

Stitching it all together!

Observations and next steps:

Conclusion