Hey @Kylie,
good observation! So output_scores should max_length - 1. The reason is that the first token, the decoder_start_token_id is not generated, meaning that no scores can be calculated.
Here an example:
#!/usr/bin/env python3
from transformers import AutoModelForSeq2SeqLM
import torch
model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large')
out = model.generate(torch.tensor([10 * [1]]), return_dict_in_generate=True, output_scores=True, max_length = 10)
print("len scores:", len(out.scores)) # should give 9
Would you be interested in correcting the documentation in a PR for Transformers?