Hi!
If I want to use an already trained Machine Translation model for inference, I do something along these lines:
from transformers import MarianMTModel, MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained(“Helsinki-NLP/opus-mt-en-de”)
model=MarianMTModel.from_pretrained(“Helsinki-NLP/opus-mt-en-de”)sentence_en=“I am stuck with text_pair argument of Tokenizer.”
input_ids=tokenizer(sentence_en, return_tensors=“pt”)[“input_ids”]
generated_sequence = model.generate(input_ids=input_ids)[0].numpy().tolist()translated_sentence=tokenizer.decode(generated_sequence, skip_special_tokens=True)
print(translated_sentence)
and it will return a German translation for the English sentence I fed to the model without a problem. In this example above I only needed to feed a source (English) sentence. However, if I now want to train a Machine Translation model from scratch, I will need to feed it pairs of English and German sentences. Which in turn means that I need to provide tokenizer with an English-German pair (here for simplicity I assume I will using batches of size 1). Do I do it this way? (see below)
from transformers import MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained(“Helsinki-NLP/opus-mt-en-de”)
sentence_en=“I am stuck with text_pair argument of Tokenizer.”
sentence_de=“Ich stecke mit text_pair Argument von Tokenizer fest.”encoded_input = tokenizer(text=sentence_en, text_pair=sentence_de)
If yes, I can’t make sense of my encoded_input, which looks like that:
{‘input_ids’: [38, 121, 21923, 33, 2183, 585, 25482, 14113, 7, 429, 2524, 7359, 3, 38, 492, 11656, 7662, 30, 2183, 585, 25482, 48548, 728, 21, 429, 2524, 7359, 17, 4299, 3, 0], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
There are no token_type_ids in the encoded_input. How can I supply it to the model for training if there is no way for it to know where the source English text ended and the target German one started? If I convert the above ids to tokens:
print(tokenizer.convert_ids_to_tokens(encoded_input[‘input_ids’]))
I get the following:
[‘▁I’, ‘▁am’, ‘▁stuck’, ‘▁with’, ‘▁text’, ‘', ‘pair’, ‘▁argument’, ‘▁of’, ‘▁To’, ‘ken’, ‘izer’, ‘.’, ‘▁I’, ‘ch’, ‘▁ste’, ‘cke’, ‘▁mit’, ‘▁text’, '’, ‘pair’, ‘▁Argu’, ‘ment’, ‘▁von’, ‘▁To’, ‘ken’, ‘izer’, ‘▁’, ‘fest’, ‘.’, ‘’]
So, the tokenizer simply concatenated two sentences and tokenized the concatenated text. There are no separators or anything else, which would distinguish the source from the target.
What do I understand wrong? What is the right way to tokenize a source-target pair, having in mind that it will later be fed for a MT model for training?
Will appreciate help as I’ve been stuck with this simple issue for quite a while by now.