Is there a way to return the "decoder_input_ids" from "tokenizer.prepare_seq2seq_batch"?

Using BART as an example … this:

tokenizer.prepare_seq2seq_batch(
    src_texts=['This is a very short text', 'This is shorter'],
    tgt_texts=['very short', 'much shorter than very short'])

returns …

{'input_ids': [[100, 19, 3, 9, 182, 710, 1499, 1], [947, 19, 10951, 1, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0]], 
'labels': [[182, 710, 1, 0, 0, 0], [231, 10951, 145, 182, 710, 1]]}

For fine-tuning, how should we build the decoder_input_ids? And do we also need to shift the labels to the right so that they look like this?

[[710, 1, 0, 0, 0, 0], [10951, 145, 182, 710, 1, 0]]

Or do we even have to pass decoder_input_ids anymore???

Looking at this example for MT5, it looks like hte answer is “no” …

from transformers import MT5ForConditionalGeneration, T5Tokenizer
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
summary = "Weiter Verhandlung in Syrien."
batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], tgt_texts=[summary], return_tensors="pt")
outputs = model(**batch)
loss = outputs.loss

This sure would make it easier if all we have to pass in are the “labels” and not have to deal with the decoder_input_ids ourselves when working within ConditionalGeneration models. Please lmk either way.

Thanks

Hi @wgpubs,

if you just pass labels the decoder_input_ids are prepared inside the model by shifting the labels. See

1 Like

Ah cool … thanks for confirming.

Are the target tokens (the labels) replaced with the ignore token id somewhere as well? Doesn’t look like it from what I can see … so I’m assuming we need to do that ourselves, and pass the label ids with padding tokens set to -100.

Also, the decoder_input_ids come back in the form of <eos> <bos> X ..., but my understanding was always that it should start with <bos> and the labels shifted so that <bos> attempts to predict X[0] and so forth.

Yes, we should manually replace pad token with -100 in labels.

Ideally yes, it should start with bos token, but in the original fairseq implementation the models are trained with <eos> <bos> X .... , so we have kept it like that for reproducibility.