Issue with finetuning a seq-to-seq model

@danyaljj

Were you able to get things working?

Also, the decoder for T5 in general seems not to print decoded special tokens even when skip_special_tokens is false. I filed a bug github issue

1 Like

Sorry, I havenā€™t been able to look more into this (more, next week).
But my guess is that there is a problem with end-of-sentence special token that the model fails generate reasonably-lengthed sentences (potentially related to the issue that you filed).

Iā€™m a little confused about decoder_input_ids, labels, and loss calculations. The T5 examples Iā€™m looking at all do it slightly differently, and the documentation seems a bit unclear.

I want to train a seq2seq task that involves language generation. I have source and target ā€œsentencesā€ that are pretokenized (via batch_encode_plus). (Note that batch_encode_plus should be appending the EOS tokens to all of the inputs and targets). The model gets an input sequence and should generate a full output sequence, where Iā€™m finetuning it on these source=> target pairs. I am not using any of the extra_id.

First, I am setting up labels as follows:

src_ids = batch["source_ids"].to(device, dtype=torch.long)
src_mask = batch["source_mask"].to(device, dtype=torch.long)
tgt_ids = batch["target_ids"].to(device, dtype=torch.long)

# set up labels
tgt_ids[tgt_ids[:,:] == 0 ] = -100     # I guess could use masked_fill_()
label_ids = tgt_ids.to(device)        # redundant to send to device twice?
out_dict = model(src_ids, attention_mask=src_mask, labels=label_ids, return_dict=True)

Should I be cloning or detaching here (as seen in the example below)?

Primary question: decoder_input_ids vs labels

ā€“ When labels are given (without decoder_inputs), as I have done above, the model will call shift_right, which shifts the labels right (by one) and replaces any -100s with 0s;
ā€“ see _shift_right(), where we have this line
shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)

But decoder_input_ids vs labels seem to be handled differently in various linked examples. Consider

This t5 finetuning for summary generation example seems not to do the initial right shift

y = data['target_ids'].to(device, dtype = torch.long)
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone().detach()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
ids = data['source_ids'].to(device, dtype = torch.long)
mask = data['source_mask'].to(device, dtype = torch.long)

outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
loss = outputs[0]
  • Is it a problem that they donā€™t do the right shift?

And finetune.py from the transformerā€™s library (in _step) does this:

src_ids, src_mask = batch["input_ids"], batch["attention_mask"]
tgt_ids = batch["labels"]
if isinstance(self.model, T5ForConditionalGeneration):
    decoder_input_ids = self.model._shift_right(tgt_ids)
else:  # for bart
    decoder_input_ids = shift_tokens_right(tgt_ids, pad_token_id)
outputs = self(src_ids, attention_mask=src_mask, decoder_input_ids=decoder_input_ids, use_cache=False)
lm_logits = outputs[0]

if self.hparams.label_smoothing == 0:
	# Same behavior as modeling_bart.py, besides ignoring pad_token_id
	ce_loss_fct = torch.nn.CrossEntropyLoss(ignore_index=pad_token_id)

	assert lm_logits.shape[-1] == self.vocab_size
	loss = ce_loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), tgt_ids.view(-1))

Questions:

  • I donā€™t get why we do the right_shift_ outside of the call to self(..) rather than just letting the model do it for us (i.e. by passing labels instead of decoder_input_ids). Why do it this way?
  • How does this loss computation differ from what outputs[1] (loss) would be when computed by the model during the outputs=self(...) call? If we passed the labels into the self(..., labels=) would the modelā€™s loss output be the same as the one we get here?

The other questions come to mind in all this:

  1. How do the generated lm_logits line up with the labels? Labels look like [300, 200, 125, 1, 0, 0, ...] (i.e. a sequence of 3 tokens and then EOS = 1 followed by pad = 0). But the decoder_input_ids are right shifted by one ([0, 300, 200, 125, 1, 0, 0, ...]). Does the model during the forward pass generate a <pad> token at index 0, or will it generate the first expected output token? Otherwise when we take the loss, we are comparing the lm_logits (potentially right shifted by one) to the unshifted labels.

  2. When would I want decoder_input_ids and labels to be different?

  3. Causal mask: if I donā€™t change the masks at all (i.e. I just have masking on the pad tokens for the source_ids), will anything else be hidden from the T5 model? In particular, would causal masking get applied, and what would the effect be? When would I want causal masking?

  4. Documentation for T5forConditionalGeneration says that if decoder_input_ids are not provided then input_ids will be used. But actually labels will be used?

1 Like

It looks like the lm_logits do line up with the unshifted labels (which we expect for teacher forcing!). Iā€™m still curious about the other questions.

I did manage to get a T5 model to finetune well on my specific task. I wrote my own trainer. As a very basic first task, I gave the model some input sentences of length N, and then just had it learn to copy the first word, only (and then generate an EOS). It did appropriately learn to generate the EOS after only a few 10k training examples.

Though itā€™s still possible that the decoder or EOS generation is somehow an issue in the finetune.py script, in my example it was not an issue: I made calls only to batch_encode_plus, batch_decode_plus, forward, generate(), and the model does appropriately learn to generate the EOS tokens as token number 2.

I will try again using the finetune.py script soon with the same datasets and hyperparams and see if I can replicate results.

@jsrozner Thanks for updating the thread! I am also anxious to see if you will be able to replicate the results with finetune.py.

I ran finetune.py on a simple copy of source -> target. 5000 examples with 200 eval / 200 test. (In my own trainer, the model got 95% perfect copy including EOS tokens after 3 epochs). I ran with same hyperparams (LR, grad clipping, though I didnā€™t set the adam epsilon for finetune.py) and also for 3 epochs.

One other note about difference: in my own test I was running for 100 epochs and observed that the model was nearly perfect after 3 epochs. But here, in order to get test predictions at epoch=3, I set max_epochs=3. So if LR in finetune.py is, e.g., a linear scheduler with total steps proportional to 3 epochs, then that could also be an issue.

The modelā€™s loss does move toward 0.
But even after setting eval_beams=1, eval_max_gen_length=40, it still continues to generate many more tokens than it should:

For example:
ā€œA fine tale is pleasantā€ (both source and target) gives
ā€˜A fine tale is pleasant. A fine story is pleasantā€¦ A Fine tale is enjoyable.!ā€™,

This happens for almost all source -> target pairs, though the model almost always gets the first N tokens right, where N is the length of the target output.

One other thing I notice: ā€œsummarizeā€ is always being prepended to the inputs (looking at text_batch.json)

Itā€™s possible that

  • epochs throws off the learning rate
  • generate has some other params like temperature that are affecting it
  • the ā€œsummarizeā€ task note that is prepended is making the model want to fill up the whole output to the eval_max_gen_length

Next test would be removing the prepended ā€œsummarizeā€. Is there an easy way to do that for the script?

it still continues to generate many more tokens than it should

That was exactly my observation too. Which led me to think that somehow the model is not learning EOS character (hence, the generation is not functioning as expected).

Re. prefix:

Looks like the prefix is set here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L243

which seems like itā€™s passed here:

Where self.model.config.prefix is being picked up from? Not sure.

For anyone else following along, read github issue note for

  • how to fix the issue. It seems to be with min_len in generate
  • details about the decoder fix (though it seems not to have been the problem here)
1 Like

Here is also a working simple script to finetune T5, with intermediate generations. github

1 Like

does this solve this? How to add all standard special tokens to my tokenizer and model?