How to tokenize input if I plan to train a Machine Translation model. I'm having difficulties with text_pair argument of Tokenizer()

Hi!

If I want to use an already trained Machine Translation model for inference, I do something along these lines:

from transformers import MarianMTModel, MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained(“Helsinki-NLP/opus-mt-en-de”)
model=MarianMTModel.from_pretrained(“Helsinki-NLP/opus-mt-en-de”)

sentence_en=“I am stuck with text_pair argument of Tokenizer.”

input_ids=tokenizer(sentence_en, return_tensors=“pt”)[“input_ids”]
generated_sequence = model.generate(input_ids=input_ids)[0].numpy().tolist()

translated_sentence=tokenizer.decode(generated_sequence, skip_special_tokens=True)

print(translated_sentence)

and it will return a German translation for the English sentence I fed to the model without a problem. In this example above I only needed to feed a source (English) sentence. However, if I now want to train a Machine Translation model from scratch, I will need to feed it pairs of English and German sentences. Which in turn means that I need to provide tokenizer with an English-German pair (here for simplicity I assume I will using batches of size 1). Do I do it this way? (see below)

from transformers import MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained(“Helsinki-NLP/opus-mt-en-de”)

sentence_en=“I am stuck with text_pair argument of Tokenizer.”
sentence_de=“Ich stecke mit text_pair Argument von Tokenizer fest.”

encoded_input = tokenizer(text=sentence_en, text_pair=sentence_de)

If yes, I can’t make sense of my encoded_input, which looks like that:

{‘input_ids’: [38, 121, 21923, 33, 2183, 585, 25482, 14113, 7, 429, 2524, 7359, 3, 38, 492, 11656, 7662, 30, 2183, 585, 25482, 48548, 728, 21, 429, 2524, 7359, 17, 4299, 3, 0], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

There are no token_type_ids in the encoded_input. How can I supply it to the model for training if there is no way for it to know where the source English text ended and the target German one started? If I convert the above ids to tokens:

print(tokenizer.convert_ids_to_tokens(encoded_input[‘input_ids’]))

I get the following:

[‘▁I’, ‘▁am’, ‘▁stuck’, ‘▁with’, ‘▁text’, ‘’, ‘pair’, ‘▁argument’, ‘▁of’, ‘▁To’, ‘ken’, ‘izer’, ‘.’, ‘▁I’, ‘ch’, ‘▁ste’, ‘cke’, ‘▁mit’, ‘▁text’, '’, ‘pair’, ‘▁Argu’, ‘ment’, ‘▁von’, ‘▁To’, ‘ken’, ‘izer’, ‘▁’, ‘fest’, ‘.’, ‘’]

So, the tokenizer simply concatenated two sentences and tokenized the concatenated text. There are no separators or anything else, which would distinguish the source from the target.

What do I understand wrong? What is the right way to tokenize a source-target pair, having in mind that it will later be fed for a MT model for training?

Will appreciate help as I’ve been stuck with this simple issue for quite a while by now.

Hi,

To fine-tune a MarianMT (or any other seq2seq model in the library), you don’t need to feed the source and target sentences at once to the tokenizer. Instead, they should be tokenized separately:

from transformers import MarianTokenizer

tokenizer =  MarianTokenizer.from_pretrained(“Helsinki-NLP/opus-mt-en-de”)

input_ids = tokenizer(“I am stuck with text_pair argument of Tokenizer.”, return_tensors="pt").input_ids
labels = tokenizer(“Ich stecke mit text_pair Argument von Tokenizer fest.”, return_tensors="pt").input_ids

You can then train as follows:

outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss

This is because we feed the encoder of the seq2seq model only the encoded input sentence (as input_ids). The decoder’s output will then be compared against the labels to compute the loss.

The text_pair use case is only when we would provide sentence A [SEP] sentence B to a model, which is done for example when using BERT to classify the relationship between 2 sentences, or for question answering, where we feed question [SEP] context to the model.

1 Like

Note that you need to tokenize your labels in the target context manager, otherwise they will be tokenized as English and not German:

with tokenizer.as_target_tokenizer():
    labels = tokenizer(“Ich stecke mit text_pair Argument von Tokenizer fest.”, return_tensors="pt").input_ids
2 Likes

Ok wow, TIL about the as_target_tokenizer. I see it exists for several multilingual seq2seq models.

1 Like

Thank you, guys! You are true stars!! Great help from a wonderful resource.

2 Likes