Pretrained XLM model with TLM objective generates nonsensical predictions

Hi, I want to use the xlm-mlm-tlm-xnli15-1024 pretrained model, which is the XLM model trained with the auxiliary Translation Language Modeling (TLM) objective.

I want to give a translation pair to the model, mask some words in one of the sentences and then get the predictions of the model for the masked words. Check the figure for reference.

My problem is that the model makes nonsensical predictions, which probably means that I am doing something wrong. Here is a code snippet:

import torch
from transformers import XLMWithLMHeadModel, XLMTokenizer

model_name = "xlm-mlm-tlm-xnli15-1024"
tokenizer = XLMTokenizer.from_pretrained(model_name)
model = XLMWithLMHeadModel.from_pretrained(model_name)
model.eval()

src_lang_id = tokenizer.lang2id["en"] # English
trg_lang_id = tokenizer.lang2id["el"] # Greek

src_text = "I love pasta with tomato sauce!".replace("tomato", tokenizer.mask_token)
trg_text = "Μου αρέσουν τα ζυμαρικά με σάλτσα ντομάτας!"

print(f"{src_text}->{trg_text}")

# get token_ids
src_input_ids = torch.tensor([tokenizer.encode(src_text)])
trg_input_ids = torch.tensor([tokenizer.encode(trg_text)])

src_len = src_input_ids.shape[1]
trg_len = trg_input_ids.shape[1]

# get lang_ids
src_langs = torch.tensor([src_lang_id] * src_len).view(1, -1)
trg_langs = torch.tensor([trg_lang_id] * trg_len).view(1, -1)

# get token_type_ids
src_type = torch.tensor([0] * src_len).view(1, -1)
trg_type = torch.tensor([1] * trg_len).view(1, -1)

input_ids = torch.cat([src_input_ids, trg_input_ids], dim=1)
token_type_ids = torch.cat([src_type, trg_type], dim=1)
lang_ids = torch.cat([src_langs, trg_langs], dim=1)
position_ids = torch.cat([torch.arange(src_len), torch.arange(trg_len)])

# encode and predict
result = model(input_ids,
               langs=lang_ids,
               position_ids=position_ids.view(1, -1),
               token_type_ids=token_type_ids)

# get predictions for masked token
masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1].tolist()[0]
result = result[0][:, masked_index].topk(5).indices
result = result.tolist()[0]

print(f"Predictions:", tokenizer.decode(result))

Console output:

I love pasta with <special1> sauce!->Μου αρέσουν τα ζυμαρικά με σάλτσα ντομάτας!
Predictions: with the 'i'my

I tried omitting some of the arguments to the model, changing the example sentence-pair and the languages, but I always get weird predictions.

What am I doing wrong?

P.S. I had to downgrade to transformers==2.9.0, because in the newer versions I get this error message:

Some weights of XLMWithLMHeadModel were not initialized from the model checkpoint at xlm-mlm-tlm-xnli15-1024 and are newly initialized: ['transformer.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

I also noticed that even in that version, the predictions are the same, which means that there is something else going on.