Is there a way to return the "decoder_input_ids" from "tokenizer.prepare_seq2seq_batch"?

wgpubs · December 23, 2020, 12:52am

Using BART as an example … this:

tokenizer.prepare_seq2seq_batch(
    src_texts=['This is a very short text', 'This is shorter'],
    tgt_texts=['very short', 'much shorter than very short'])

returns …

{'input_ids': [[100, 19, 3, 9, 182, 710, 1499, 1], [947, 19, 10951, 1, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0]], 
'labels': [[182, 710, 1, 0, 0, 0], [231, 10951, 145, 182, 710, 1]]}

For fine-tuning, how should we build the decoder_input_ids? And do we also need to shift the labels to the right so that they look like this?

[[710, 1, 0, 0, 0, 0], [10951, 145, 182, 710, 1, 0]]

wgpubs · December 23, 2020, 1:57am

Or do we even have to pass decoder_input_ids anymore???

Looking at this example for MT5, it looks like hte answer is “no” …

from transformers import MT5ForConditionalGeneration, T5Tokenizer
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
summary = "Weiter Verhandlung in Syrien."
batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], tgt_texts=[summary], return_tensors="pt")
outputs = model(**batch)
loss = outputs.loss

This sure would make it easier if all we have to pass in are the “labels” and not have to deal with the decoder_input_ids ourselves when working within ConditionalGeneration models. Please lmk either way.

Thanks

valhalla · December 23, 2020, 6:11am

Hi @wgpubs,

if you just pass labels the decoder_input_ids are prepared inside the model by shifting the labels. See

github.com

huggingface/transformers/blob/cbe63949d76efd153a1f389f38fe9ce1287e06b0/src/transformers/models/bart/modeling_bart.py#L1229-L1231


if decoder_input_ids is None:
    decoder_input_ids = shift_tokens_right(labels, self.config.pad_token_id)

wgpubs · December 23, 2020, 11:08pm

Ah cool … thanks for confirming.

wgpubs · December 26, 2020, 7:42pm

Are the target tokens (the labels) replaced with the ignore token id somewhere as well? Doesn’t look like it from what I can see … so I’m assuming we need to do that ourselves, and pass the label ids with padding tokens set to -100.

Also, the decoder_input_ids come back in the form of <eos> <bos> X ..., but my understanding was always that it should start with <bos> and the labels shifted so that <bos> attempts to predict X[0] and so forth.

valhalla · December 29, 2020, 6:47am

Yes, we should manually replace pad token with -100 in labels.

Ideally yes, it should start with bos token, but in the original fairseq implementation the models are trained with <eos> <bos> X .... , so we have kept it like that for reproducibility.

Topic		Replies	Views
T5 decoder predicting tokens even after hitting end of sequence token, i.e </s> 🤗Transformers	4	328	February 26, 2024
Train T5/BART to convert a string into multiple strings 🤗Transformers	1	1676	December 10, 2022
Training BART, error when preparing decoder_input_ids. Shape of input_ids? Beginners	3	1454	August 7, 2020
T5 models: About the decoder_input_ids argument Models	0	758	December 5, 2022
Generate 'continuation' for seq2seq models Intermediate	1	1861	February 22, 2021

Is there a way to return the "decoder_input_ids" from "tokenizer.prepare_seq2seq_batch"?

Related topics