What can cause model.generate (BART) output to be gibberish after fine-tuning?

Hi @rgwatwormhill, gradient need to be zeroed for every pytorch model, otherwise they get accumulated.

Hi @hyura
For Bart or any other seq2seq model the decoder_input_ids need to be shifted right i.e the decoder sequence need to start with decoder_start_token_id which is usually bos or pad or eos token. For Bart, it’s eos.

This means the decoder first takes the decoder_start_token_id and produces the first token in the labels. If it’s not shifted then it’s just copying whatever the token it has received at that step, which could be reason for this weird generation.

If you are on master version, then there are few helpers for preparing the data.

  • Use prepare_seq2seq_batch method, this will return input_ids , attention_mask and labels
  • use modeling_bart.shift_tokens_right to prepare the decoder_input_ids.
  • set pad token in the labels to -100 so they’ll be ignored by the cross entropy loss
input_text = "Some input text"
output_text = "Paraphrase Text"

#enc will contain input_ids, attention_mask and labels
enc = tokenizer.prepare_seq2seq_batch(src_texts=input_text, tgt_texts=output_text, return_tensors="pt") 
decoder_input_ids = shift_tokens_right(enc["labels"], tokenizer.pad_token_id)

labels = enc["labels"]
labels[labels == pad_token_id] = -100

Hope this helps.
cc @sshleifer

4 Likes