Perplexity for BART summaries

Hi, I’m using the BART large model trained on Gigaword for summarisation and was trying to calculate the perplexity of the output summary.

I’m doing the following since I’m using beam search:

    model_checkpoint = 'a1noack/bart-large-gigaword'
    tokenizer = BartTokenizerFast.from_pretrained("a1noack/bart-large-gigaword")
    model = BartForConditionalGeneration.from_pretrained(model_checkpoint, return_dict=True)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    test = load_dataset("gigaword", split='test[:20]')
    encodings =  tokenizer(test['document'], return_tensors='pt', padding=True, truncation=True, max_length=1024).to(device)

    model = model.to(device)
    model.eval()
    number_beams = 8
    result = model.generate(encodings['input_ids'],  num_beams=number_beams, return_dict_in_generate=True, max_length=model.config.max_length, output_scores=True, output_attentions=True)
    
    log_sent = []

    for batch_num in range(0, result.scores[0].shape[0], number_beams):
        max_score = torch.tensor(-1*1e6, dtype=torch.float).to(device)
        for beam_num in range(number_beams):
            max_score = torch.max(torch.stack([torch.max(result.scores[-1][batch_num+beam_num]), max_score]))
        log_sent.append(max_score)
        
    print("Perplexity:", torch.exp((-1*(torch.stack(log_sent).sum()))/result.sequences.shape[1]))

This is based on my understanding from the answer to this Showing individual token and corresponding score during beam search - #2 by monmanuela by patrickvonplaten and
Generation Probabilities: How to compute probabilities of output scores for GPT2.

I’m unsure if this is the right way to use the output of scores. I’m new to HF and NLP. I haven’t been able to find a similar issue resolved on the forum so it would be great if someone could confirm if this is the right way to compute perplexity?

In case someone is still looking for a solution to this, here’s some sample code I did to get Bart perplexity scores. Bart Token Level Perplexity. Note that this is for masked language modeling, not summarization so it may need to be adapted for that specific task.