Tokenizer - Add new Tokens

mortom · August 11, 2022, 4:23pm

Hi,
I’m trying to use the Protein T5 Model (Rostlab/prot_t5_xl_uniref50 · Hugging Face) with some additional letters other than the traditional amino acids.

I can add these tokens to the Tokenizer through its method “add_tokens”.

But do I need to apply care when doing so? Does the order I add these tokens matter? Or the order compared to the ones present already? Do I need to reorder them somehow?

Thanks in advance.

ianuragbhatt · May 24, 2023, 1:01pm

You don’t have to apply anything just directly use tokenizer.add_tokens(<list_of_words>). Order of these tokens doesn’t matter.

Topic		Replies	Views
How to properly add news tokens to tokenizer vocab? Beginners	0	154	May 14, 2024
2 possible bugs for adding new tokens to T5 🤗Transformers	3	1316	June 25, 2024
Customizing T5 tokenizer for finetuning 🤗Transformers	1	615	May 2, 2024
Add_tokens + finetune 🤗Transformers	0	521	February 25, 2022
T5 for conditional generation: getting started Beginners	20	18545	July 19, 2023

Tokenizer - Add new Tokens

Related topics