Using BART as an example … this:
tokenizer.prepare_seq2seq_batch(
src_texts=['This is a very short text', 'This is shorter'],
tgt_texts=['very short', 'much shorter than very short'])
returns …
{'input_ids': [[100, 19, 3, 9, 182, 710, 1499, 1], [947, 19, 10951, 1, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0]],
'labels': [[182, 710, 1, 0, 0, 0], [231, 10951, 145, 182, 710, 1]]}
For fine-tuning, how should we build the decoder_input_ids
? And do we also need to shift the labels
to the right so that they look like this?
[[710, 1, 0, 0, 0, 0], [10951, 145, 182, 710, 1, 0]]