BartForConditionalGeneration is erroneous either at .forward or at .generate

System Info

  • transformers version: 4.20.1
  • Platform: Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.12
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.10.1+cu111 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: False

Who can help?

@patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

text = """
Phillip,   Could you please do me a favor?\nI would like  to read your current title policy to see what \
it says about easements.\nYou  should have received a copy during your closing.\nI don't know how many \
pages it will be but let me know how you want to handle getting a copy  made.\nI'll be happy to make the copy,\
or whatever makes it easy for  you.\nThanks,\n
"""

checkpoint = "Aktsvigun/bart-base_aeslc_42"
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).cuda()
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

input_ids = tokenizer(text, truncation=True, return_tensors="pt")["input_ids"].to(model.device)

generate_output = model.generate(
    input_ids, num_return_sequences=4, length_penalty=1., return_dict_in_generate=True, output_scores=True, early_stopping=True
)

# Most probable labels according to the generate output. Taking from first since do not need initial generation token.
labels = generate_output.sequences[0][generate_output.sequences[0] != 1][None, 1:]
out = model(input_ids, labels=labels)
probas = torch.nn.functional.softmax(out.logits, dim=-1)

sequence_score = probas[0].log().gather(index=labels[0][:, None], dim=-1).sum() / len(labels[0])
assert torch.allclose(-sequence_score, out.loss)
assert torch.allclose(sequence_score, generate_output.sequences_scores[0])

Expected behavior

The last assert must be passed, yet the results differ (-0.8670 for reconstructed score and -0.8581 from generated output). What happens in the code: I first generate the sequence with BART, and then I try to reproduce the score by calling .forward (reproducing the score as the average of log-probas of labels ids taken from each decoder iteration).

Why is it important: this is a “sub-bug” which I found, verifying another bug: I wrote a function to restore the sequences and sequences scores from transformers.generation_utils.BeamSearchEncoderDecoderOutput.scores and got slightly different results with the ones outputted by transformers.generation_utils.BeamSearchEncoderDecoderOutput. Namely, I restore some sequences with the scores, higher than transformers.generation_utils.BeamSearchEncoderDecoderOutput.sequences_scores. I need to check, which version (default / mine) is correct, hence I need to pass the sequence with forward and calculate its “intrinsic” score. However, as this example shows, either .forward or .generate return slightly erroneous results.