Hello there!
I ran into duplicate tokens issue when trying to use converted model (Marian Model, Helsinki-NLP / Tatoeba-Challenge) in HF environment.
I have converted the model to usable in HF using the convert_marian_tatoeba_to_pytorch.py script.
%%bash
cd /content/gdrive/MyDrive/convert_exp/transformers
python src/transformers/models/marian/convert_marian_to_pytorch.py --src PATH_TO_MARIAN_MODEL --dest PATH_TO_PT_MODEL
>> added 1 tokens to vocab
When working with the conversion result, the tokenizer works fine, but model.generate() generates duplicate tokens.
tokenizer = AutoTokenizer.from_pretrained("PATH_TO_PT_MODEL")
model = AutoModelForSeq2SeqLM.from_pretrained("PATH_TO_PT_MODEL")
input_line, device = 'سلام ، آیا می توانید این را ترجمه کنید؟', 'cuda'
model.to(device)
inputs = tokenizer.encode(input_line, return_tensors="pt").to(device)
out = model.generate(inputs, max_length=100) # limiting max_length
out
>> tensor([[63282, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927,
12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927,
12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927,
12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927,
12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927,
12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927,
12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927,
12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927,
12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927,
12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 12927, 0]],
device='cuda:0')
I assumed that there was an error in the position of the special tokens in the dictionary. In this regard, I performed shift_tokens_right, by analogy with how it is described in this topic.
from transformers.models.marian import modeling_marian
inputs = modeling_marian.shift_tokens_right(input_ids=inputs, pad_token_id=tokenizer.pad_token_id, decoder_start_token_id=tokenizer.pad_token_id)
out = model.generate(inputs, max_length=100) #limiting max_length
out
>> tensor([[63282, 40393, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876,
15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876,
15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876,
15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876,
15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876,
15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876,
15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876,
15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876,
15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876,
15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 15876, 0]])
Making pad_token_id equal to -100 didn’t give proper result too.
inputs[inputs == tokenizer.pad_token_id] = -100
out = model.generate(inputs, max_length=100) #limiting max_length
out
>> ---------------------------------------------------------------------------
>> IndexError Traceback (most recent call last)
>> <ipython-input-40-d891cf4b1584> in <module>()
>> ----> 1 out = model.generate(inputs, max_length=100) #limiting max_length
>> 2 out
>> 7 frames
>> /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
>> 1911 # remove once script supports set_grad_enabled
>> 1912 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
>> -> 1913 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
>> 1914
>> 1915
>> IndexError: index out of range in self
Did I make a mistake while converting or did I miss a step to make the model work in HF environment?
A draft of the work can be viewed in this Google Colab.