I was wondering why the length of the
output_scores is always +1 longer than the
max_length in the output of
generate()? This seems to be not consistent with the documentation https://huggingface.co/transformers/internal/generation_utils.html#transformers.generation_utils.BeamSampleEncoderDecoderOutput
I found the
scores from the output of the
generate() function when setting
output_scores to be
(max_length+1,)-shaped tensors or shorter due to the early
eos_token_id with each element of shape
(batch_size*num_beams, config.vocab_size). The shape of the
good observation! So
output_scores should max_length - 1. The reason is that the first token, the
decoder_start_token_id is not generated, meaning that no scores can be calculated.
Here an example:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large')
out = model.generate(torch.tensor([10 * ]), return_dict_in_generate=True, output_scores=True, max_length = 10)
print("len scores:", len(out.scores)) # should give 9
Would you be interested in correcting the documentation in a PR for Transformers?
Hi @patrickvonplaten ,
Thanks for your reply! That makes more sense. Sure, I can correct that in a PR.
Any reason that the
decoder_start_token_id is concatenated to the beginning of the generated token ids?
Hi @ad26kr, BART uses <bos> at the beginning of the decoder input to indicate the start of decoding. This is how the model pretrained (see Figure 1(c) of the paper).
I know that several seq2seq models such as BART, T5 use some special tokens as the first input token to let the decoder start decoding. However, I can’t understand why ‘the generated text/tokens from the model.generate()’ includes the special tokens (in the beginning of the generated token ids)