I am trying to randomly initialize a T5 small model. I only care about the translation task. The problem is, the uninitialized model outputs the decoder’s input. I understand that we’re teacher-forcing during training, thus the model does see the correct inputs up to timestep T-1 during training, but an uninitialized model should not be able to output the correct next word regardless. This behavior results in a low cross-entropy loss which impedes training. Of course, if I try to autoregressively generate outputs, it’s complete nonsense.
Snippet to reproduce the problem;
import torch
from transformers import AutoModelForSeq2SeqLM
from transformers.models.t5.modeling_t5 import T5Config
from transformers import AutoTokenizer
model = AutoModelForSeq2SeqLM.from_config(config=T5Config.from_pretrained("t5-small")) # Uninitialized T5 model
sentence = "translate English to Romanian: Hello, my dog is cute"
label = "Salut, câinele meu este frumos"
tokenizer = AutoTokenizer.from_pretrained("t5-small")
inputs = tokenizer(sentence, return_tensors="pt")
attention_mask = inputs["attention_mask"]
inputs = inputs["input_ids"]
decoder_input_ids = tokenizer(label, return_tensors="pt")
decoder_input_ids = decoder_input_ids["input_ids"]
decoder_attention_mask = decoder_input_ids.ne(tokenizer.pad_token_id).long()
outputs = model(input_ids=inputs, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, decoder_attention_mask=decoder_attention_mask)
outputs = outputs.logits
print(tokenizer.decode(torch.argmax(outputs, dim=-1).squeeze(0))) # Returns "Salut, câinele meu este frumos</s>"
What am I missing?