Hi,
I am finetuning a T5 model for QA on my dataset but the vocab is so different than the tokenizer’s, which results in an excessive length of token_ids/tokens. can I train a new tokenizer from the existing one and use it for finetuning? if yes, any tips/resources to aid?
Thanks
what i did was make a set of words i want to be tokenized and used tokenizer.add_tokens(new_tokens)
.
remember to resize embedding weights in the model as well: model.resize_token_embeddings(len(tokenizer))