Should I need to use pre_train-tokenizer?

If I have legal text data and I want to summarize, so the new word which not present in pre-train_tokenizer (embading), obviously the the cousin semilarty will be zero.

Example
cousin(“judiciare”, “judgement”).