Hierarchical Semantic search over pdfs

Hello,

I’d like to implement a semantic search for PDFs or various documents.

I’ve been across Faiss and I’ve got it to work after a few tries (using LangChain library). At first I had problems since many of the docs were in italian but I fixed by switching the sentence transformer from all-MiniLM-L6-v2 to paraphrase-multilingual-MiniLM-L12-v2.

The result is far away from perfection, it doesn’t always find what I’m looking for and I think the main concern is the conversion from PDFs to vectors.

Currently it’s not taking in count the hierarchy of the document (titles, subtitles, paragraph), is there a way to do so? Also, is it possible to define a “word weight” to set the priority of some words instead of others?

Any best practice or guide is appreciated.

Also, I’m not sure FAISS can do all the work, maybe there are good alternatives such as ElasticSearch?

Thanks in advance