Extending the tokenizer affects model generation

Hi,

I followed the example and extended the tokenizer for a new language. Something like this:

llama_tokenizer = AutoTokenizer.from_pretrained(base_model)
new_tokenizer = llama_tokenizer.train_new_from_iterator(training_corpus, 138000)

The new tokenizer itsels seems to work fine where I am able to decode in both the original and new language. But If I use the model to generate new text, it does not work anymore:

model = AutoModelForCausalLM.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(<new_tokenizer>)
model.resize_token_embeddings(len(tokenizer))

If i use this model to generate and batch_decode in the original language of the base model, it produces answers mixing text from original and new language. Why is this happening ? I have also trained the model with more text from new language and that also does not seem to help.

Thanks
Mohan

1 Like
tokenizer = AutoTokenizer.from_pretrained(<new_tokenizer>)

This is just a hypothesis, but I think that AutoTokenizer may not recognize your new tokenizer.
If the structure itself has not changed, you can use the LlamaTokenizer class, and if the structure itself has changed, you can use it by defining a new class.

Thanks John. I tried some of your approaches. It did not help much (may be I did implement correctly). But that gave me some clues. Finally I tried this.

model = AutoModelForCausalLM.from_pretrained(base_model)
llama_tokenizer = AutoTokenizer.from_pretrained(<base_llama_model>)
tokenizer = AutoTokenizer.from_pretrained(<new_tokenizer)

diff_tokens = set(tokenizer.vocab).difference(llama_tokenizer.vocab)
llama_tokenizer.add_tokens(list(diff_tokens))
model.resize_token_embeddings(len(llama_tokenizer))

This seems to work. It is still not clear to me as to why using the new tokenizer I showed before does not work ? Isn’t that expected to work ? I followed the example (train_from_new_iterator) from the example in the document.

Thanks
Mohan

1 Like

Well, whatever the case, I’m glad it seems to be working.:grinning:

It is still not clear to me as to why using the new tokenizer I showed before does not work ? Isn’t that expected to work ?

It seems that the problem with the tokenizer not updating properly has been around for a while…
Apparently, the function itself is working as expected and there is no bug, but it’s not easy for users to understand.

1 Like