Encoder Decoder Model gives same generation results after finetuning

:question: Questions & Help
Hi, everyone. I am using transformers(v 4.20.1) and try to build a seq2seq model for multi-label classification. However, I found the model always gives same generation results after finetuning. I found two related issues in github but there seems to exist no solution.

Here’s the code. Main logic is located in init and train method.

class Model(pl.LightningModule):

    def __init__(self,
                 decoder_tokenizer,
                 lr=1e-4,
                 beam_size=1,
                 num_decoder_layers=12,):
        super().__init__()

        self.pad_id = decoder_tokenizer.pad_token_id
        self.bos_id = decoder_tokenizer.bos_token_id
        self.eos_id = decoder_tokenizer.eos_token_id
        self.lr = lr
        self.beam_size = beam_size
        self.decoder_tokenizer = decoder_tokenizer

        encoder_config = RobertaConfig.from_pretrained('roberta-base')
        decoder_config = RobertaConfig(bos_token_id=self.bos_id,
                                       eos_token_id=self.eos_id,
                                       pad_token_id=self.pad_id)
        decoder_config.num_hidden_layers = num_decoder_layers

        self.config = EncoderDecoderConfig.from_encoder_decoder_configs(
            encoder_config, decoder_config)
        self.model = EncoderDecoderModel(self.config)
        self.decoder = self.model.get_decoder()
        self.decoder.resize_token_embeddings(decoder_tokenizer.vocab_size)
        self.model.config.vocab_size = self.model.config.decoder.vocab_size

        nn.init.xavier_uniform_(self.decoder.resize_token_embeddings().weight)

        self.model.config.decoder_start_token_id = self.bos_id
        self.model.config.pad_token_id = self.pad_id

    def training_step(self, batch, batch_idx):
        '''batch, a dict contains 
        input_ids: ids for the input sequence
        attention_mask: mask for the input sequence
        labels: ids for the output sequence
        '''
        self.model.train()
        loss = self.model(**batch).loss
        self.log("train_loss",
                 loss,
                 on_step=True,
                 on_epoch=True,
                 prog_bar=True,
                 logger=True)
        self.model.eval()
        with torch.no_grad():
            predictions = self.model.generate(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                num_beams=self.beam_size,
                min_length=7,
                max_length=7,
                no_repeat_ngram_size=1,
                do_sample=False).cpu().numpy().tolist()
            labels=batch['labels'].cpu().numpy().tolist()
            for pred,label in zip(predictions,labels):
                logger.info(f'pred: {pred}, label: {label}')
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.lr)
        return optimizer

The phenomenon is:

  • At the begin of the training, a sanity check starts. I sample some generation results(demonstrated below), it can be seen that the initialized-model is able to generate different predictions.
 INFO: idx:   0, pred: [101, 1595, 1438, 985, 3304, 3195, 800], label: [101, 122, 153, 174, 1161, 1618, 102, -100, -100, -100]
 INFO: idx:   6, pred: [101, 1595, 1438, 985, 3304, 3195, 800], label: [101, 338, 498, 587, 2905, 102, -100, -100, -100, -100]
 INFO: idx:   7, pred: [101, 1595, 1438, 985, 3304, 3195, 800], label: [101, 109, 112, 143, 164, 278, 973, 102, -100, -100]
 INFO: idx:   9, pred: [101, 1595, 1438, 985, 3304, 3195, 800], label: [101, 109, 112, 116, 137, 174, 260, 102, -100, -100]
 INFO: idx:  10, pred: [101, 1595, 1438, 1135, 3886, 3698, 1406], label: [101, 107, 115, 119, 123, 310, 431, 102, -100, -100]
  • After finetuning only 1 update, the model starts to generate same results, for both trained samples and unseen validation samples, until the end of the training(30 epochs).
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 107, 119, 123, 168, 243, 306, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 105, 109, 195, 230, 587, 1617, 2375, 102]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 105, 107, 123, 559, 716, 1376, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 111, 130, 168, 183, 256, 102, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 122, 142, 222, 336, 2072, 2248, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 105, 147, 159, 355, 795, 102, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 111, 113, 232, 261, 651, 849, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 149, 150, 730, 1356, 2940, 102, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 113, 179, 211, 523, 996, 1366, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 154, 1002, 1040, 102, -100, -100, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 254, 984, 1238, 102, -100, -100, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 132, 289, 504, 730, 895, 2450, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 105, 109, 137, 260, 303, 461, 888, 102]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 107, 131, 161, 205, 259, 763, 102, -100]

I’ve been stuck at the problem for almost a week, finding no solutions yet. I checked the model architecture and did find cross attention layers in decoder. I’ve also checked the data format and related logic, all works well, so I omitted this part for simplicity. Therefore I think the bug might exists in the model side, but I haven’t found useful info from the docs or google results.

Any kind of help is appreciated. Thanks very much!

I make some tries and get some results. It is indeed caused the model-side code.
First, I replace the MLC task with a simple auto-encoding task, namely that feed the model same inputs and outputs. Besides, I tie word embeddings between encoder and decoder. These do not solves the problem.
Next, I replace the seq2seq model with a simple prefix-LM(RobertaForCausalLM), fed with same auto-encoding results. And as I guessed before, the problem vanished. All things work well now.
I believe there exists some bug in my code or transformers library.

:sob:
Does anyone have any possible solution?