Hierarchical Semantic search over pdfs

AndreaDatanet · September 8, 2023, 7:58am

Hello,

I’d like to implement a semantic search for PDFs or various documents.

I’ve been across Faiss and I’ve got it to work after a few tries (using LangChain library). At first I had problems since many of the docs were in italian but I fixed by switching the sentence transformer from all-MiniLM-L6-v2 to paraphrase-multilingual-MiniLM-L12-v2.

The result is far away from perfection, it doesn’t always find what I’m looking for and I think the main concern is the conversion from PDFs to vectors.

Currently it’s not taking in count the hierarchy of the document (titles, subtitles, paragraph), is there a way to do so? Also, is it possible to define a “word weight” to set the priority of some words instead of others?

Any best practice or guide is appreciated.

Also, I’m not sure FAISS can do all the work, maybe there are good alternatives such as ElasticSearch?

Thanks in advance

Topic		Replies	Views
Gemma 3 - RAG - PDF Models	2	1612	March 20, 2025
How to find the closest matching sentence using sentence transformer and faiss? Beginners	1	1217	July 28, 2022
RAG with Gemma 3 Models	2	459	March 17, 2025
Multi-lang non-OCR PDF text recognition Beginners	0	540	November 12, 2023
Poor Results with FAISS Index on RAG System 🤗Transformers	0	606	March 13, 2024

Hierarchical Semantic search over pdfs

Related topics