Good pre-trained models for Document Answering tasks?

Hi everyone! I hope you guys are doing well. I’m a complete beginner in the AI field and I have only completed few projects based on the YouTube videos that I’ve watched. I’m looking to use pre-trained models for Document Answering task. But I’m unsure which models to use as there many good models out there.

Can you guys suggest me models that I can look into and maybe past projects? I have a PDF that’s 80 pages with tables, numbers, currency and more. I’m looking to use this PDF file as the data and let the user ask any questions related to the file. I understand RAG is a thing now and I wonder if I should look into RAG as well.

Thank you.

Hello @steve01-1

you have to create embeddings of your documents. You can create embeddings in chunks in the length of the max sequence length of your or for each sentence.

To answer your question, it depends on your task and the pdf-Content. A good starting point are BERT models with mean pooling. Embed you query and the your pdf-files and then do a cosine similarity check and retrieve the top-k as your context in the query. The “all-mpnet-base-v1” is also a option.
Here, you also find good models to start: Pretrained Models — Sentence-Transformers documentation

You can go a step further to improve your retrieving process with re-ranking or augmented SBERT.

I hope it was helpful.

Best regards
Christian

Hi @Christian2901

Thanks so much for the information!

you’re welcome