Part 3: Desktop App for Document QA with RAG - Document Layout Analysis, Text Extraction and Generation
A DIY style step-by-step guide to building your own cutting-edge GenAI-powered document QA desktop app with RAG. In Part 3, we focus on a Document Layout Analysis and extraction pipeline.
August 24, 2024 · 18 min · 3777 words
In our previous posts of this series, we set up embedding generation and storage using the Stella_en_1.5B_v5 model and created a mini Vector Store inspired by Spotify’s ANNOY. Now, we’ll look at extracting text from PDFs and using LLaMA3 for text generation.
Extracting text from files poses another set of unique challenges, especially if .pdf is involved! The horrors of .pdf processing has left deep scars on all of us who have attempted to work with PDF files. Here, we’ll make a shallow attempt at layout detection and text extraction from .pdf files using pre-trained models by the folks at Unstructured-IO.
Text extraction from .txt files are straightforward, just read it. For .pdf, the problem can largely be broken down into the following steps:
Layout Analysis: Understanding the structure of each page of a .pdf. This will involve converting pages of a pdf file to images, running a layout detection model on it to deduce the regions of interest.
Data extraction: Extracting text or other objects like tables, images etc. from the regions of interest. For now, we’ll skip tables and images. That’s another beast!
OCR (if required): Image based PDF files would require us to use a ocr tool to extract text. I’ve used tesseract or ghostscript + tesseract before. To keep stuff simple-ish - let’s work only with text based PDF files.
Attempting to run the Detectron2 based ONNX model with candle-onnx hit some snag, not all onnx ops are supported by candle-onnx yet!
While I’ve used the tract onnx runtime before, for this project I’ll lean on using ort onnx runtime.
Let’s add the crates ort and image to our dependencies. While at it, we’ll also add the crate pdfium-render crate which is a thin wrapper around google/pdfium, a library for rendering and working with .pdf files.
Note on `pdfium-render`
Setting up pdfium-render is a bit more involved than just adding it to our Cargo.toml. We’ll tackle this soon!
Let’s code up layout detection in a new file src-tauri/src/layout.rs. Let’s create a struct RegionOfInterest to hold areas of the document that we’ll work on and an enum DetectedElem to map the model’s predicted classes.
// Imports omitted
// Copied from: https://github.com/styrowolf/layoutparser-ort/blob/master/src/utils.rs
/// Utility function to convert bbox to a array
fnvec_to_bbox<T: Copy>(v: Vec<T>)-> [T;4]{[v[0],v[1],v[2],v[3]]}// An emum to represent the classes of regions of interest
// detected by the `layout detection` model
#[derive(Debug, PartialEq, Eq, Copy, Clone)]pubenumDetectedElem{Text,Title,List,Table,Figure,}implDisplayforDetectedElem{fnfmt(&self,f: &mutstd::fmt::Formatter<'_>)-> std::fmt::Result{write!(f,"{}",matchself{Self::Text=>"Text",Self::Title=>"Title",Self::List=>"List",Self::Table=>"Table",Self::Figure=>"Figure",})}}/// This struct represents a Region of interest
#[derive(Debug)]pubstructRegionOfInterest{kind: DetectedElem,// the bounding box - x1, y1, x2, y2 - top, left, bottom, right
bbox: [f32;4],// confidence: f32,
}implRegionOfInterest{pubfnkind(&self)-> DetectedElem{self.kind}pubfnbbox(&self)-> [f32;4]{self.bbox}}
Now, we’ll need a struct for the Detectron2 based Layout Analysis model.
// Code omitted ..
/// A [`Detectron2`](https://github.com/facebookresearch/detectron2)-based model.
pubstructDetectron2Model{model: Session,label_map: [DetectedElem;5],}// Copied from: https://github.com/styrowolf/layoutparser-ort/blob/master/src/utils.rs
/// Utility function to convert bbox to a array
fnvec_to_bbox<T: Copy>(v: Vec<T>)-> [T;4]{[v[0],v[1],v[2],v[3]]}implDetectron2Model{/// Required input image width.
pubconstREQUIRED_WIDTH: usize=800;/// Required input image height.
pubconstREQUIRED_HEIGHT: usize=1035;/// Default confidence threshold for detections.
pubconstDEFAULT_CONFIDENCE_THRESHOLD: f32=0.85;pubfnnew()-> Result<Self>{// Loading and initializing the model from `onnx` file
letmodel=Session::builder()?.with_optimization_level(GraphOptimizationLevel::Level3)?// We could make this a little more generic with `numcpus` crate
.with_intra_threads(8)?.commit_from_file("../models/layout.onnx")?;// You could print the model outputs to figure out which prediction datapoints are useful
// println!("{:?}", model.outputs);
Ok(Self{model,label_map: [DetectedElem::Text,DetectedElem::Title,DetectedElem::List,DetectedElem::Table,DetectedElem::Figure,],})}pubfnpredict(&self,page: &image::DynamicImage)-> Result<Vec<RegionOfInterest>>{let(img_width,img_height,input)=self.preprocess(page)?;// let hm = HashMap::from([("x.1".to_string(), input)]);
letres=self.model.run(ort::inputs!["x.1"=>input]?)?;self.postprocess(res,img_width,img_height)}// 1. Resizes an image to the required format!
// 2. Creates a tensor from the image
// 3. Reshapes the tensor to channel first format
// 4. Creates input ndarray for `ort` to consume
fnpreprocess(&self,img: &image::DynamicImage)-> Result<(u32,u32,ort::Value)>{// TODO: re-visit this and resize smarter
let(img_width,img_height)=(img.width(),img.height());letimg=img.resize_exact(Self::REQUIRED_WIDTHasu32,Self::REQUIRED_HEIGHTasu32,imageops::FilterType::Triangle,);letimg=img.to_rgb8().into_raw();// Read the image as a tensor
lett=Tensor::from_vec(img,(Self::REQUIRED_HEIGHT,Self::REQUIRED_WIDTH,3),&Device::Cpu,)?.to_dtype(DType::F32)?.permute((2,0,1))?// shape: [3, height, width]
.to_vec3::<f32>()?.concat().concat();// Create a `ndarray` input for `ort` runtime to consume
letinput=ort::Value::from_array(([3,Self::REQUIRED_HEIGHT,Self::REQUIRED_WIDTH],&t[..]))?;Ok((img_width,img_height,input.into()))}// Reads the predictions and converts them to regions of interest
fnpostprocess(&self,outputs: SessionOutputs<'_,'_>,width: u32,height: u32,)-> Result<Vec<RegionOfInterest>>{// Extract predictions for bounding boxes,
// labels and confidence scores
// Shape: [num pred, 4]
letbboxes=&outputs[0].try_extract_tensor::<f32>()?;// Shape: [num pred]
letlabels=&outputs[1].try_extract_tensor::<i64>()?;// 3 for MASK_RCNN_X_101_32X8D_FPN_3x | 2 for FASTER_RCNN_R_50_FPN_3X
// Shape: [num pred]
letconfidence=&outputs[3].try_extract_tensor::<f32>()?;// We had originally `resized` the image to fit
// the required input dimensions,
// we are just going to adjust the predictions to factor in the resize
letwidth_factor=widthasf32/Self::REQUIRED_WIDTHasf32;letheight_factor=heightasf32/Self::REQUIRED_HEIGHTasf32;// Iterate over (region bounding boxes, predicted classes/ labels, and confidence scores)
letmutelements=bboxes.rows().into_iter().zip(labels.iter().zip(confidence.iter())).filter_map(|(bbox,(&label,&confidence))|{// Skip everything below some confidence score we want to work with
ifconfidence<Self::DEFAULT_CONFIDENCE_THRESHOLD{returnNone;}// Getting the predicted label from the predicted index
letlabel=self.label_map.get(labelasusize)?;// We don't have any way of interpreting Figure and Table as text
// So, we'll skip that
iflabel==&DetectedElem::Figure||label==&DetectedElem::Table{returnNone;}let[x1,y1,x2,y2]=vec_to_bbox(bbox.iter().copied().collect::<Vec<_>>());// Adjusting the predicted bounding box to our original image size
Some(RegionOfInterest{kind: *label,bbox: [x1*width_factor,y1*height_factor,x2*width_factor,y2*height_factor,],confidence})}).collect::<Vec<_>>();// Now we sort the predictions to (kind of) visual hierarchy
// from top left
elements.par_sort_unstable_by(|a,b|{(a.bbox()[1].max(a.bbox()[3])).total_cmp(&(b.bbox()[1].max(b.bbox()[3])))});Ok(elements)}}
The comments in the code should suffice as an overview of what’s happening here!
Pro Tip: Figuring out Model Outputs
Most onnx models would be well documented and we’ll know the outputs we work with. In case its NOT, print the model.output field.
0th index is of shape [-1, 4] - meaning [variable size, 4] - we can assume this to be the bounding boxes.
The 1st index of the output is of type int and shape [-1] (variable) - we can be reasonably certain that this is the index of the predicted class.
The 3rd index is of type Float32 and variable size, this would be your predicted confidence
Finally, the 4th index says it’s [2], which would generally indicate that we are looking at a binary classification of some kind. I’m pretty sure that refers to landscape vs portrait classification of the given document image, though, we have not used the 4th output.
To text extraction based on the detected layout we’ll need to:
Convert each page of a .pdf to an image - pdfium-render and image crates we added earlier has a role to play here
Predict the layout of the page
From the predicted regions of interest, we’ll retrieve the text
Generate embeddings for the text
`pdfium` Quickstart
The crate pdfium-render provides high-level wrappers on top of the original pdfium C++ library, but it doesn’t ship with the required library files. The author(s) of pdfium-render provides multiple ways of binding the C++ libraries, we are going to take the dynamic approach.
Steps to get this up and running:
Create a .cargo directory inside src-tauri and a src-tauri/.cargo/config.toml file with the following content:
Download the .tgz archive for your OS and platform from pdfium-binaries repo for version v6666 and put it inside the directory binaries/pdfium in our project root.
Untar it inside the binaries/ppdfium directory:
tar -xvzf <downloaded_file>.tgz
I had to move the binaries/pdfium/lib/libpdfium.dylib in my case to the directory binaries/pdfium/libpdfium.dylib - Mac won’t allow the execution because it can’t verify the developer. Moving it changes the metadata of the file, that’s probably why it works!🧐🤯🫨😵💫
With pdfium-render ready, let’s wrap up the .pdf -> layout analysis -> text flow. We create a strutct PdfProc that will hold everything we need to analyze and extract data from .pdf files.
src-tauri/src/doc.rs
28
29
30
31
32
33
34
35
// Imports and code omitted
pubstructPdfProc{pdfs: Vec<PathBuf>,layout: Detectron2Model,pdfium: Pdfium,pdfium_cfg: PdfRenderConfig,}
Now we expose a bunch of methods for this flow to work, for brevity I’ll focus on the key methods.
// code omitted ..
implPdfProc{// initializer ..
pubfnnew(model_path: &Path,pdfs: Vec<PathBuf>)-> Result<Self>{// Straitforward implementation for initializing the different fields of the struct
Ok(Self{...})}/// Returns the total number of pages to be analyzed
pubfnestimate(&self)-> usize{// this can be used to show some progress in real world
}/// Extract text from `pdfs`
pubfnextract(&self,send: Sender<ExtractorEvt>)-> Result<Vec<Vec<(String,FileKind)>>>{// for each `.pdf` file we are going to convert the pages to images
letfile_encoded=self.pdfs.iter().filter_map(|file|{letpdf=self.pdfium.load_pdf_from_file(&file,None).ok()?;self.process_pages(file,pdf,send.clone())}).collect::<Vec<_>>();Ok(file_encoded)}/// Processes each pages
/// - reneders page with rendering config
/// - runs layout detection
/// - reads text from bounding boxes detected by the model
pubfnprocess_pages(&self,file: &PathBuf,doc: PdfDocument<'_>,send: Sender<ExtractorEvt>,)-> Option<Vec<(String,FileKind)>>{Some(doc.pages().iter().enumerate().filter_map(|(idx,page)|{// Convert the page to an image
letimg=page.render_with_config(&self.pdfium_cfg).ok()?.as_image();// Renders this page to an image::DynamicImage...
// Keep track of the factors by which the page and images of pages were resized to
// This is required to get accurate output from the predicted regions of interest
letw_f=page.width().value/img.width()asf32;leth_f=page.height().value/img.height()asf32;letpg_num=idx+1;// send the image for prediction
// and for each predicted `region of interest`
// fetch the text inside the bounding box
lettext=self.layout.predict(&img).ok()?.iter().filter_map(|e|{// The bounding box for the region of interest
letbbox=e.bbox();// x1, y1, x2, y2
// The bounding boxes for the predicted regions follow a `left-top` co-ordinate system
// But `pdfium` uses a bottom-left coordinate system, let's convert it
// We'll also factor in the original page size here
lettop=page.height().value-bbox[1]*h_f+PADDING;letbottom=page.height().value-bbox[3]*h_f-PADDING;letleft=bbox[0]*w_f-PADDING;letright=bbox[2]*w_f+PADDING;// Now, we have the `pdfium` compatible bounding boxes
// Let's fetch the text
lettext=page.text().ok()?.inside_rect(PdfRect::new_from_values(bottom,left,top,right)).replace("\t"," ").replace("\r\n","\n");Some(matche.kind(){// We are using `MarkDownSplitter` for our text splitting task
// Here we are adding `##` to mark the generated text as title
DetectedElem::Title=>{format!("## {}\n",text.replace("\n","; "))}// Rest of the text remains as is
DetectedElem::Text|DetectedElem::List=>text,_=>unimplemented!(),})}).collect::<Vec<_>>().join("\n");ifletErr(e)=send.send(ExtractorEvt::Page){eprintln!("Warn: error sending page event: {e:?}");}Some((text,FileKind::Pdf((file.to_owned(),pg_num))))}).collect::<Vec<_>>(),)}}
Notice that some methods accept a std::sync::mpsc::Sender, we’ll elaborate on this later but the idea here is to emit some execution status to show progress on the client side.
For reference, this is the first page of the .pdf we are using for this test:
cd src-tauri
cargo test extract_from_pdf --release -- --nocapture
cd ..
The results are very accurate!
Pro Tip
In real world data the predictions Layout Detection will not always be clean!
You’ll need to play around the model hyperparameters, different models, padding for text extraction etc. to narrow down on a reasonable output that works Most of the Time.
So far we have tackled Embeddings generation with Stella_en_1.5B_v5 model, Document Layout Analysis with a Detectron2 based Mask RCNN model and we have extracted texts from .pdf files and learnt how to split text into semantic chunks.
To conclude this Part 3 of this series, lay the foundation for text generation using LLaMA3. I’m using LLaMA 3.1 series model, but you can choose anything you prefer based on your hardware.
Quickstart with LLaMA3.1
LLaMA3.x series requires you to accept the Meta LLaMA License - follow this blog post to known more about LLaMA 3.1 and how to acquire the LLaMA models
Once you have access to the models in HuggingFace - download the weights (.safetensors) files, tokenizer.json and config.json to our models directory
Pro tip!
The proper way of fetching model weights and related files would be to accept the token string provided by Meta and use it to dynamically download the files from HuggingFace Hub on application init. I’ll leave that implementation up to you!
// Imports omitted ..
// Sampling constants
constTEMPERATURE: f64=0.8;constTOP_P: f64=0.95;constTOP_K: usize=32;/// A struct to maintain a initialized Llama quantized `gguf` model and associated methods
pubstructGenerator{cfg: Config,device: Device,model: Llama,tokenizer: Tokenizer,sampler: LogitsProcessor,stop_tokens: [u32;2],}implGenerator{// Download model `safetensor` files into your project dir `models` folder
// I'm using LLaMA3.1 8B instruct, you can use whatever you want
constMODEL_FILES: [&'staticstr;2]=["model-00001-of-00002.safetensors","model-00002-of-00002.safetensors",];constTOKENIZER_FILE: &'staticstr="llama_tokenizer.json";constMODEL_CONFIG_FILE: &'staticstr="llama_config.json";/// Initializer for new llama manager
pubfnnew(dir: &Path,device: &Device)-> Result<Self>{letmutdevice=device.to_owned();ifletcandle_core::Device::Metal(mutm)=device{m.set_use_mlx_mm(false);device=Device::Metal(m);}let(model,mutcfg,tokenizer)=Self::load_model(dir,&device)?;letstop_tokens=[tokenizer.token_to_id("<|eot_id|>").unwrap(),tokenizer.token_to_id("<|end_of_text|>").unwrap(),];cfg.max_position_embeddings=4096;// Initializing the sampler
letsampler=LogitsProcessor::from_sampling(42,Sampling::TopKThenTopP{k: TOP_K,p: TOP_P,temperature: TEMPERATURE,},);println!("Llama ready!");Ok(Self{cfg,device: device.clone(),model,tokenizer,sampler,// sampler2,
stop_tokens,})}// A utility function to load the model and tokenizer
fnload_model(model_dir: &Path,device: &Device)-> Result<(Llama,Config,Tokenizer)>{// let model_file = model_dir.join(Self::MODEL_FILE);
lettok_file=model_dir.join(Self::TOKENIZER_FILE);letcfg_file=model_dir.join(Self::MODEL_CONFIG_FILE);letmodel_files=Self::MODEL_FILES.iter().map(|mf|model_dir.join(mf)).collect::<Vec<_>>();println!("Loading LLaMA ..");letstart=Instant::now();letcfg=serde_json::from_slice::<LlamaConfig>(&std::fs::read(&cfg_file)?)?.into_config(false);letvb=unsafe{VarBuilder::from_mmaped_safetensors(&model_files,DType::BF16,device)?};letllama=Llama::load(vb,&cfg)?;println!("LLaMA loaded in {}s",(Instant::now()-start).as_secs());lettokenizer=Tokenizer::from_file(tok_file).unwrap();Ok((llama,cfg,tokenizer))}// Utility function to run the generation loop
fngenerate(&mutself,prompt: &str)-> Result<String>{// Tokenize the input
letinput=self.tokenizer.encode(prompt,true).map_err(|e|anyhow!(e))?;ifinput.len()>=self.cfg.max_position_embeddings{returnErr(anyhow!("large input tokens!"));}letmutcache=Cache::new(true,DType::BF16,&self.cfg,&self.device)?;// Creating a tensor of input tokens
letmutip=Tensor::new(input.get_ids(),&self.device)?.unsqueeze(0)?;letmutstart=std::time::Instant::now();// The forward pass to the first token
letmutlogits=self.model.forward(&ip,0,&mutcache)?;// Sampling the first token
letmutnext=self.sampler.sample(&logits.squeeze(0)?)?;println!("{} prompt tokens processed @ {}t/s",input.len(),input.len()asf32/(std::time::Instant::now()-start).as_secs()asf32);// A container for all tokens generated
letmutall_tokens=vec![next];start=std::time::Instant::now();// Forward pass - decoder loop
foriininput.len()..self.cfg.max_position_embeddings{ip=Tensor::new(&[next],&self.device)?.unsqueeze(0)?;logits=self.model.forward(&ip,i,&mutcache)?;next=self.sampler.sample(&logits.squeeze(0)?).unwrap();ifself.stop_tokens.contains(&next){break;}all_tokens.push(next);}println!("{} tokens generated @ {}t/s",all_tokens.len()-1,(all_tokens.len()-1)asf32/(std::time::Instant::now()-start).as_secs_f32());// Decode tokens and return result
Ok(matchself.tokenizer.decode(&all_tokens[..],false){Ok(t)=>t,Err(e)=>{eprintln!("Error generating tokens: {e:?}");anyhow::bail!("Error generating tokens")}})}}
With that the foundation for the G of our RAG is ready.
Write a test!
Write a test for the Generator. The new( .. ) function accepts the path to your models directory and a candle Device while the method generate( .. ) accepts a prompt string.
If you are new at text generation read up about prompt templates and try to create the prompt string based on the LLaMA3 prompt template.
Now we have all of the different components ready for our RAG to work, we have our Retrieval ready with Embeddings and a Vector Store, and Generation with LLaMA. In the next post, Part 4, we’ll tie these independent blocks together.