Issue with finetuning a seq-to-seq model

jsrozner · October 27, 2020, 7:23pm

Were you able to get things working?

jsrozner · October 28, 2020, 4:17am

Also, the decoder for T5 in general seems not to print decoded special tokens even when skip_special_tokens is false. I filed a bug github issue

danyaljj · October 28, 2020, 4:41am

Sorry, I haven’t been able to look more into this (more, next week).
But my guess is that there is a problem with end-of-sentence special token that the model fails generate reasonably-lengthed sentences (potentially related to the issue that you filed).

jsrozner · November 4, 2020, 8:48pm

I’m a little confused about decoder_input_ids, labels, and loss calculations. The T5 examples I’m looking at all do it slightly differently, and the documentation seems a bit unclear.

I want to train a seq2seq task that involves language generation. I have source and target “sentences” that are pretokenized (via batch_encode_plus). (Note that batch_encode_plus should be appending the EOS tokens to all of the inputs and targets). The model gets an input sequence and should generate a full output sequence, where I’m finetuning it on these source=> target pairs. I am not using any of the extra_id.

First, I am setting up labels as follows:

src_ids = batch["source_ids"].to(device, dtype=torch.long)
src_mask = batch["source_mask"].to(device, dtype=torch.long)
tgt_ids = batch["target_ids"].to(device, dtype=torch.long)

# set up labels
tgt_ids[tgt_ids[:,:] == 0 ] = -100     # I guess could use masked_fill_()
label_ids = tgt_ids.to(device)        # redundant to send to device twice?
out_dict = model(src_ids, attention_mask=src_mask, labels=label_ids, return_dict=True)

Should I be cloning or detaching here (as seen in the example below)?

Primary question: decoder_input_ids vs labels

– When labels are given (without decoder_inputs), as I have done above, the model will call shift_right, which shifts the labels right (by one) and replaces any -100s with 0s;
– see _shift_right(), where we have this line
shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)

But decoder_input_ids vs labels seem to be handled differently in various linked examples. Consider

This t5 finetuning for summary generation example seems not to do the initial right shift

y = data['target_ids'].to(device, dtype = torch.long)
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone().detach()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
ids = data['source_ids'].to(device, dtype = torch.long)
mask = data['source_mask'].to(device, dtype = torch.long)

outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
loss = outputs[0]

Is it a problem that they don’t do the right shift?

And finetune.py from the transformer’s library (in `_step`) does this:

src_ids, src_mask = batch["input_ids"], batch["attention_mask"]
tgt_ids = batch["labels"]
if isinstance(self.model, T5ForConditionalGeneration):
    decoder_input_ids = self.model._shift_right(tgt_ids)
else:  # for bart
    decoder_input_ids = shift_tokens_right(tgt_ids, pad_token_id)
outputs = self(src_ids, attention_mask=src_mask, decoder_input_ids=decoder_input_ids, use_cache=False)
lm_logits = outputs[0]

if self.hparams.label_smoothing == 0:
	# Same behavior as modeling_bart.py, besides ignoring pad_token_id
	ce_loss_fct = torch.nn.CrossEntropyLoss(ignore_index=pad_token_id)

	assert lm_logits.shape[-1] == self.vocab_size
	loss = ce_loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), tgt_ids.view(-1))

Questions:

I don’t get why we do the right_shift_ outside of the call to self(..) rather than just letting the model do it for us (i.e. by passing labels instead of decoder_input_ids). Why do it this way?
How does this loss computation differ from what outputs[1] (loss) would be when computed by the model during the outputs=self(...) call? If we passed the labels into the self(..., labels=) would the model’s loss output be the same as the one we get here?

The other questions come to mind in all this:

How do the generated lm_logits line up with the labels? Labels look like [300, 200, 125, 1, 0, 0, ...] (i.e. a sequence of 3 tokens and then EOS = 1 followed by pad = 0). But the decoder_input_ids are right shifted by one ([0, 300, 200, 125, 1, 0, 0, ...]). Does the model during the forward pass generate a <pad> token at index 0, or will it generate the first expected output token? Otherwise when we take the loss, we are comparing the lm_logits (potentially right shifted by one) to the unshifted labels.
When would I want decoder_input_ids and labels to be different?
Causal mask: if I don’t change the masks at all (i.e. I just have masking on the pad tokens for the source_ids), will anything else be hidden from the T5 model? In particular, would causal masking get applied, and what would the effect be? When would I want causal masking?
Documentation for T5forConditionalGeneration says that if decoder_input_ids are not provided then input_ids will be used. But actually labels will be used?

jsrozner · November 6, 2020, 4:17am

It looks like the lm_logits do line up with the unshifted labels (which we expect for teacher forcing!). I’m still curious about the other questions.

I did manage to get a T5 model to finetune well on my specific task. I wrote my own trainer. As a very basic first task, I gave the model some input sentences of length N, and then just had it learn to copy the first word, only (and then generate an EOS). It did appropriately learn to generate the EOS after only a few 10k training examples.

Though it’s still possible that the decoder or EOS generation is somehow an issue in the finetune.py script, in my example it was not an issue: I made calls only to batch_encode_plus, batch_decode_plus, forward, generate(), and the model does appropriately learn to generate the EOS tokens as token number 2.

I will try again using the finetune.py script soon with the same datasets and hyperparams and see if I can replicate results.

danyaljj · November 6, 2020, 4:47am

@jsrozner Thanks for updating the thread! I am also anxious to see if you will be able to replicate the results with finetune.py.

jsrozner · November 7, 2020, 2:49am

I ran finetune.py on a simple copy of source -> target. 5000 examples with 200 eval / 200 test. (In my own trainer, the model got 95% perfect copy including EOS tokens after 3 epochs). I ran with same hyperparams (LR, grad clipping, though I didn’t set the adam epsilon for finetune.py) and also for 3 epochs.

One other note about difference: in my own test I was running for 100 epochs and observed that the model was nearly perfect after 3 epochs. But here, in order to get test predictions at epoch=3, I set max_epochs=3. So if LR in finetune.py is, e.g., a linear scheduler with total steps proportional to 3 epochs, then that could also be an issue.

The model’s loss does move toward 0.
But even after setting eval_beams=1, eval_max_gen_length=40, it still continues to generate many more tokens than it should:

For example:
“A fine tale is pleasant” (both source and target) gives
‘A fine tale is pleasant. A fine story is pleasant… A Fine tale is enjoyable.!’,

This happens for almost all source -> target pairs, though the model almost always gets the first N tokens right, where N is the length of the target output.

One other thing I notice: “summarize” is always being prepended to the inputs (looking at text_batch.json)

It’s possible that

epochs throws off the learning rate
generate has some other params like temperature that are affecting it
the “summarize” task note that is prepended is making the model want to fill up the whole output to the eval_max_gen_length

Next test would be removing the prepended “summarize”. Is there an easy way to do that for the script?

danyaljj · November 7, 2020, 7:13am

it still continues to generate many more tokens than it should

That was exactly my observation too. Which led me to think that somehow the model is not learning EOS character (hence, the generation is not functioning as expected).

Re. prefix:

Looks like the prefix is set here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L243

which seems like it’s passed here:

github.com

huggingface/transformers/blob/068e6b5eddb5dba296f0c42f36fffb9368c9fe24/examples/seq2seq/finetune.py#L79


      
          self.hparams_save_path = Path(self.output_dir) / "hparams.pkl"
          pickle_save(self.hparams, self.hparams_save_path)
          self.step_count = 0
          self.metrics = defaultdict(list)
          self.model_type = self.config.model_type
          self.vocab_size = self.config.tgt_vocab_size if self.model_type == "fsmt" else self.config.vocab_size
          
          self.dataset_kwargs: dict = dict(
              data_dir=self.hparams.data_dir,
              max_source_length=self.hparams.max_source_length,
              prefix=self.model.config.prefix or "",
          )
          n_observations_per_split = {
              "train": self.hparams.n_train,
              "val": self.hparams.n_val,
              "test": self.hparams.n_test,
          }
          self.n_obs = {k: v if v >= 0 else None for k, v in n_observations_per_split.items()}
          
          self.target_lens = {
              "train": self.hparams.max_target_length,

Where self.model.config.prefix is being picked up from? Not sure.

jsrozner · November 10, 2020, 8:14pm

For anyone else following along, read github issue note for

how to fix the issue. It seems to be with min_len in generate
details about the decoder fix (though it seems not to have been the problem here)

jsrozner · November 11, 2020, 3:53am

Here is also a working simple script to finetune T5, with intermediate generations. github

brando · August 11, 2022, 3:24pm

does this solve this? How to add all standard special tokens to my tokenizer and model?

Topic		Replies	Views
Problem fine-tuning a model with Seq2Seq Trainer Beginners	1	992	June 25, 2023
T5: Tips for finetuning on crossword clues (clue => answer) Models	1	629	October 14, 2020
How To Output "test_generations.txt" with run_seq2seq.py? Beginners	5	744	March 9, 2021
Issues running seq2seq distillation 🤗Transformers	4	862	January 11, 2021
Fine-tuning seq2seq: Helsinki-NLP 🤗Transformers	4	2266	December 8, 2020

Issue with finetuning a seq-to-seq model

Primary question: decoder_input_ids vs labels

This t5 finetuning for summary generation example seems not to do the initial right shift

And finetune.py from the transformer’s library (in _step) does this:

The other questions come to mind in all this:

Related topics

And finetune.py from the transformer’s library (in `_step`) does this: