I trained a new tokenizer with vocabulary from my domain. I started with
I don’t understand what model to use in order to use the new tokenizer with for semantic search.
Do I need to train the model on my vocabulary also before I can use it to create embeddings?
I read through the tutorial on FAISS here Semantic search with FAISS - Hugging Face NLP Course. I get reasonably good matches on my dataset using the base tokenizer. But if I swap the base tokenizer with the trained one - I don’t get matches.
So does the model also need to be trained on the new dataset and if so would I be looking to train from scratch or is there a way to just fine tune from this checkpoint?