Should I need to use pre_train-tokenizer?

Imran1 · June 8, 2022, 1:32pm

If I have legal text data and I want to summarize, so the new word which not present in pre-train_tokenizer (embading), obviously the the cousin semilarty will be zero.

Example
cousin(“judiciare”, “judgement”).

Topic		Replies	Views
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4386	February 20, 2022
Pretrain a model on a very specific language for NER Beginners	0	372	September 28, 2023
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023
Do you need to use the associated tokenizer Beginners	2	569	June 6, 2022

Should I need to use pre_train-tokenizer?

Related topics