Good pre-trained models for Document Answering tasks?

anon89847582 · February 20, 2024, 3:49am

Hi everyone! I hope you guys are doing well. I’m a complete beginner in the AI field and I have only completed few projects based on the YouTube videos that I’ve watched. I’m looking to use pre-trained models for Document Answering task. But I’m unsure which models to use as there many good models out there.

Can you guys suggest me models that I can look into and maybe past projects? I have a PDF that’s 80 pages with tables, numbers, currency and more. I’m looking to use this PDF file as the data and let the user ask any questions related to the file. I understand RAG is a thing now and I wonder if I should look into RAG as well.

Thank you.

Christian2901 · February 20, 2024, 5:34am

Hello @anon89847582

you have to create embeddings of your documents. You can create embeddings in chunks in the length of the max sequence length of your or for each sentence.

To answer your question, it depends on your task and the pdf-Content. A good starting point are BERT models with mean pooling. Embed you query and the your pdf-files and then do a cosine similarity check and retrieve the top-k as your context in the query. The “all-mpnet-base-v1” is also a option.
Here, you also find good models to start: Pretrained Models — Sentence-Transformers documentation

You can go a step further to improve your retrieving process with re-ranking or augmented SBERT.

I hope it was helpful.

Best regards
Christian

anon89847582 · February 20, 2024, 5:43am

Hi @Christian2901

Thanks so much for the information!

Christian2901 · February 20, 2024, 5:44am

you’re welcome

Topic		Replies	Views
Best free options if you want to train a language model on a small set of private documents? Beginners	3	450	April 5, 2024
Generate dataset for fine tuning on PDF(s) 🤗Transformers	6	3343	September 3, 2024
Chat with a PDF Beginners	7	24234	March 13, 2024
Gemma 3 - RAG - PDF Models	2	1855	March 20, 2025
Pre-trained embedding model on API specification files for RAG use case Beginners	2	31	June 24, 2025

Good pre-trained models for Document Answering tasks?

Related topics