How to use a trained tokenizer for semantic search?

panigrah · November 5, 2023, 3:17am

I trained a new tokenizer with vocabulary from my domain. I started with sentence-transformers/multi-qa-mpnet-base-dot-v1.

I don’t understand what model to use in order to use the new tokenizer with for semantic search.
Do I need to train the model on my vocabulary also before I can use it to create embeddings?

I read through the tutorial on FAISS here Semantic search with FAISS - Hugging Face NLP Course. I get reasonably good matches on my dataset using the base tokenizer. But if I swap the base tokenizer with the trained one - I don’t get matches.

So does the model also need to be trained on the new dataset and if so would I be looking to train from scratch or is there a way to just fine tune from this checkpoint?

thanks

Topic		Replies	Views
Do you have to use a model card's accompanying tokenizer? Beginners	1	307	November 4, 2022
Do you need to use the associated tokenizer Beginners	2	569	June 6, 2022
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	148	August 30, 2024
When should you train a custom tokenizer/language model? Beginners	0	340	October 9, 2021
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4386	February 20, 2022

How to use a trained tokenizer for semantic search?

Related topics