Questions & Help
Hi, everyone. I am using transformers(v 4.20.1) and try to build a seq2seq model for multi-label classification. However, I found the model always gives same generation results after finetuning. I found two related issues in github but there seems to exist no solution.
Here’s the code. Main logic is located in init and train method.
class Model(pl.LightningModule):
def __init__(self,
decoder_tokenizer,
lr=1e-4,
beam_size=1,
num_decoder_layers=12,):
super().__init__()
self.pad_id = decoder_tokenizer.pad_token_id
self.bos_id = decoder_tokenizer.bos_token_id
self.eos_id = decoder_tokenizer.eos_token_id
self.lr = lr
self.beam_size = beam_size
self.decoder_tokenizer = decoder_tokenizer
encoder_config = RobertaConfig.from_pretrained('roberta-base')
decoder_config = RobertaConfig(bos_token_id=self.bos_id,
eos_token_id=self.eos_id,
pad_token_id=self.pad_id)
decoder_config.num_hidden_layers = num_decoder_layers
self.config = EncoderDecoderConfig.from_encoder_decoder_configs(
encoder_config, decoder_config)
self.model = EncoderDecoderModel(self.config)
self.decoder = self.model.get_decoder()
self.decoder.resize_token_embeddings(decoder_tokenizer.vocab_size)
self.model.config.vocab_size = self.model.config.decoder.vocab_size
nn.init.xavier_uniform_(self.decoder.resize_token_embeddings().weight)
self.model.config.decoder_start_token_id = self.bos_id
self.model.config.pad_token_id = self.pad_id
def training_step(self, batch, batch_idx):
'''batch, a dict contains
input_ids: ids for the input sequence
attention_mask: mask for the input sequence
labels: ids for the output sequence
'''
self.model.train()
loss = self.model(**batch).loss
self.log("train_loss",
loss,
on_step=True,
on_epoch=True,
prog_bar=True,
logger=True)
self.model.eval()
with torch.no_grad():
predictions = self.model.generate(
input_ids=batch['input_ids'],
attention_mask=batch['attention_mask'],
num_beams=self.beam_size,
min_length=7,
max_length=7,
no_repeat_ngram_size=1,
do_sample=False).cpu().numpy().tolist()
labels=batch['labels'].cpu().numpy().tolist()
for pred,label in zip(predictions,labels):
logger.info(f'pred: {pred}, label: {label}')
return loss
def configure_optimizers(self):
optimizer = torch.optim.AdamW(self.parameters(), lr=self.lr)
return optimizer
The phenomenon is:
- At the begin of the training, a sanity check starts. I sample some generation results(demonstrated below), it can be seen that the initialized-model is able to generate different predictions.
INFO: idx: 0, pred: [101, 1595, 1438, 985, 3304, 3195, 800], label: [101, 122, 153, 174, 1161, 1618, 102, -100, -100, -100]
INFO: idx: 6, pred: [101, 1595, 1438, 985, 3304, 3195, 800], label: [101, 338, 498, 587, 2905, 102, -100, -100, -100, -100]
INFO: idx: 7, pred: [101, 1595, 1438, 985, 3304, 3195, 800], label: [101, 109, 112, 143, 164, 278, 973, 102, -100, -100]
INFO: idx: 9, pred: [101, 1595, 1438, 985, 3304, 3195, 800], label: [101, 109, 112, 116, 137, 174, 260, 102, -100, -100]
INFO: idx: 10, pred: [101, 1595, 1438, 1135, 3886, 3698, 1406], label: [101, 107, 115, 119, 123, 310, 431, 102, -100, -100]
- After finetuning only 1 update, the model starts to generate same results, for both trained samples and unseen validation samples, until the end of the training(30 epochs).
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 107, 119, 123, 168, 243, 306, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 105, 109, 195, 230, 587, 1617, 2375, 102]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 105, 107, 123, 559, 716, 1376, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 111, 130, 168, 183, 256, 102, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 122, 142, 222, 336, 2072, 2248, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 105, 147, 159, 355, 795, 102, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 111, 113, 232, 261, 651, 849, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 149, 150, 730, 1356, 2940, 102, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 113, 179, 211, 523, 996, 1366, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 154, 1002, 1040, 102, -100, -100, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 254, 984, 1238, 102, -100, -100, -100, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 132, 289, 504, 730, 895, 2450, 102, -100]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 105, 109, 137, 260, 303, 461, 888, 102]
INFO: pred: [101, 425, 1385, 348, 2779, 3703, 1902], label: [101, 107, 131, 161, 205, 259, 763, 102, -100]
I’ve been stuck at the problem for almost a week, finding no solutions yet. I checked the model architecture and did find cross attention layers in decoder. I’ve also checked the data format and related logic, all works well, so I omitted this part for simplicity. Therefore I think the bug might exists in the model side, but I haven’t found useful info from the docs or google results.
Any kind of help is appreciated. Thanks very much!