Adding too many tokens breaks tokenizer

bonkchad · March 12, 2024, 2:56am

If I try to add too many tokens with add_tokens() everything grinds to a halt when trying to load the tokenizer. I have been banging my head against the wall for a couple days trying to find a work around and I think its just time to reach out to the community.

What is the recommended course for a situation where I want to transfer learn from an existing model via fine tuning but also want to radically expand the vocabulary of the tokenizer? Or is that just not ever recommended? Or maybe its just generally never recommended to have more than 50k ish tokens. Or maybe I could just add one new layer directly between the embedding layer and first hidden layer to capture the sorts of patterns I would have captured with an expanded vocabulary. Or maybe I really don’t need to expand the vocabulary at all! I don’t know. I’m pretty new to all this. Any community wisdom here would be welcome and appreciated.

I really could go into a lot of details of the things I’ve tried over the last couple of days but I think it’s better to keep this short. Happy to talk more about that or my project if anyone is interested. But a brief mention of why I think expanding the vocabulary makes sense. I’m trying to teach a LLM stable-diffusion-prompt-ese which involves a lot of comma delineated expressions that reoccur and should be through t of as a single token.

Thanks!

Topic		Replies	Views
After vocabulary extension the tokenizer keeps on running 🤗Transformers	0	328	March 2, 2022
How to properly add news tokens to tokenizer vocab? Beginners	0	164	May 14, 2024
Add_tokens + finetune 🤗Transformers	0	551	February 25, 2022
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	402	March 19, 2021
Can't load pre-trained tokenizer with additional new tokens 🤗Transformers	3	4464	August 10, 2021

Adding too many tokens breaks tokenizer

Related topics