Hi @rgwatwormhill, gradient need to be zeroed for every pytorch model, otherwise they get accumulated.
Hi @hyura
For Bart or any other seq2seq model the decoder_input_ids
need to be shifted right i.e the decoder sequence need to start with decoder_start_token_id
which is usually bos
or pad
or eos
token. For Bart, it’s eos
.
This means the decoder first takes the decoder_start_token_id
and produces the first token in the labels
. If it’s not shifted then it’s just copying whatever the token it has received at that step, which could be reason for this weird generation.
If you are on master version, then there are few helpers for preparing the data.
- Use
prepare_seq2seq_batch
method, this will returninput_ids
,attention_mask
andlabels
- use
modeling_bart.shift_tokens_right
to prepare thedecoder_input_ids
. - set pad token in the
labels
to -100 so they’ll be ignored by the cross entropy loss
input_text = "Some input text"
output_text = "Paraphrase Text"
#enc will contain input_ids, attention_mask and labels
enc = tokenizer.prepare_seq2seq_batch(src_texts=input_text, tgt_texts=output_text, return_tensors="pt")
decoder_input_ids = shift_tokens_right(enc["labels"], tokenizer.pad_token_id)
labels = enc["labels"]
labels[labels == pad_token_id] = -100
Hope this helps.
cc @sshleifer