Adding too many tokens breaks tokenizer

If I try to add too many tokens with add_tokens() everything grinds to a halt when trying to load the tokenizer. I have been banging my head against the wall for a couple days trying to find a work around and I think its just time to reach out to the community.

What is the recommended course for a situation where I want to transfer learn from an existing model via fine tuning but also want to radically expand the vocabulary of the tokenizer? Or is that just not ever recommended? Or maybe its just generally never recommended to have more than 50k ish tokens. Or maybe I could just add one new layer directly between the embedding layer and first hidden layer to capture the sorts of patterns I would have captured with an expanded vocabulary. Or maybe I really don’t need to expand the vocabulary at all! I don’t know. I’m pretty new to all this. Any community wisdom here would be welcome and appreciated.

I really could go into a lot of details of the things I’ve tried over the last couple of days but I think it’s better to keep this short. Happy to talk more about that or my project if anyone is interested. But a brief mention of why I think expanding the vocabulary makes sense. I’m trying to teach a LLM stable-diffusion-prompt-ese which involves a lot of comma delineated expressions that reoccur and should be through t of as a single token.

Thanks!