Tunning tokenizer on my own dataset

Miriam · January 25, 2021, 12:09am

I have an English-written dataset with a vocabulary that contains some words that may be missing from the standard vocabulary used for RobertaTokenizer. Hence, I’d like to include additional tokens in the tokenizer. I’d like to avoid training the tokenizer from scratch, as in such case I won’t be able to fine-tune pretrained roberta model on top of it.

Since I do not know ahead what is the entire list of tokens I’d like to add, I thought I can do the following:
Train a tokenizer from scratch on my new dataset, and then to look at the created vocab file and add all the new tokens (those that do not exist in the standard RobertaTokenizer vocab) via

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
tokenizer.add_tokens(list_of_new_tokens, special_tokens=True)

(like described here: Huggingface BERT Tokenizer add new token - Stack Overflow).

Does this approach makes sense? Or am I missing something? Is there a better way to approach this issue?

Topic		Replies	Views
Adding new tokens while preserving tokenization of adjacent tokens 🤗Tokenizers	4	18766	January 25, 2024
How to properly add new vocabulary to BPE tokenizers (like Roberta)? Beginners	3	5663	December 8, 2021
Training embeddings of tokens 🤗Transformers	2	5205	January 27, 2021
After vocabulary extension the tokenizer keeps on running 🤗Transformers	0	319	March 2, 2022
Domain adaptation of Language Model and Tokenizer Beginners	8	2873	June 17, 2024

Tunning tokenizer on my own dataset

Related topics