Extending the tokenizer affects model generation

suruti94 · December 19, 2024, 1:45am

Hi,

I followed the example and extended the tokenizer for a new language. Something like this:

llama_tokenizer = AutoTokenizer.from_pretrained(base_model)
new_tokenizer = llama_tokenizer.train_new_from_iterator(training_corpus, 138000)

The new tokenizer itsels seems to work fine where I am able to decode in both the original and new language. But If I use the model to generate new text, it does not work anymore:

model = AutoModelForCausalLM.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(<new_tokenizer>)
model.resize_token_embeddings(len(tokenizer))

If i use this model to generate and batch_decode in the original language of the base model, it produces answers mixing text from original and new language. Why is this happening ? I have also trained the model with more text from new language and that also does not seem to help.

Thanks
Mohan

John6666 · December 19, 2024, 2:02am

tokenizer = AutoTokenizer.from_pretrained(<new_tokenizer>)

This is just a hypothesis, but I think that AutoTokenizer may not recognize your new tokenizer.
If the structure itself has not changed, you can use the LlamaTokenizer class, and if the structure itself has changed, you can use it by defining a new class.

suruti94 · December 19, 2024, 4:24am

Thanks John. I tried some of your approaches. It did not help much (may be I did implement correctly). But that gave me some clues. Finally I tried this.

model = AutoModelForCausalLM.from_pretrained(base_model)
llama_tokenizer = AutoTokenizer.from_pretrained(<base_llama_model>)
tokenizer = AutoTokenizer.from_pretrained(<new_tokenizer)

diff_tokens = set(tokenizer.vocab).difference(llama_tokenizer.vocab)
llama_tokenizer.add_tokens(list(diff_tokens))
model.resize_token_embeddings(len(llama_tokenizer))

This seems to work. It is still not clear to me as to why using the new tokenizer I showed before does not work ? Isn’t that expected to work ? I followed the example (train_from_new_iterator) from the example in the document.

Thanks
Mohan

John6666 · December 19, 2024, 4:37am

Well, whatever the case, I’m glad it seems to be working.

It is still not clear to me as to why using the new tokenizer I showed before does not work ? Isn’t that expected to work ?

It seems that the problem with the tokenizer not updating properly has been around for a while…
Apparently, the function itself is working as expected and there is no bug, but it’s not easy for users to understand.

Topic		Replies	Views
Adding tokens, but tokenizer doesn't use them 🤗Tokenizers	1	399	August 14, 2024
Tokenization: different results when tokenizing in one pass vs sample-by-sample Intermediate	3	1740	October 23, 2023
Loading pre-trained models with AddedTokens 🤗Transformers	2	752	October 14, 2024
Save_pretrained() on tokenizer does not generate a tokenizer.json file 🤗Transformers	3	795	August 19, 2024
Llama model outputs strange words Beginners	0	132	December 1, 2024

Extending the tokenizer affects model generation

Related topics