BART summarization token probabilities


I want to calculate the probabilities of each token in summary - given the summary. In other words, I’d like to condition the generation of the t+1 token with all the previous tokens up to t. This seems like teacher forcing, but my goal is not fine-tuning the BERT summarization model. It is to calculate token probabilities in the approach mentioned above using a summarization model like “facebook/bart-large-cnn”. So far I have something like this:

source = "(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro.."
summary = "Woman on the ship died. Another 86 passengers had fallen ill on the ship" 


e_input = tokenizer(source,return_tensors='pt')
d_input = tokenizer(summary,return_tensors='pt')
output = model(input_ids=e_input["input_ids"], attention_mask=e_input["attention_mask"], labels=d_input['input_ids'])
summary_tokens = d_input["input_ids"].squeeze()
for i, token_logits in enumerate(output.logits[0]):
  probs = torch.nn.functional.softmax(token_logits, dim=0)
  summ_token_prob = round(probs[summary_tokens[i].item()].item(),2)
  _, preds = torch.topk(probs, k=5)
  model_preds_topk = tokenizer.decode(preds).split()

I get the output as follows:

 on:0.02:['died', 'was', 'dies', "suffered's"]
 the:0.11:['MS', 'cruise', 'the', 'ship', 'same']
 ship:0.1:['MS', 'ship', 'same', 'cruise', 'Ve']
 died:0.17:['suffered', 'died', 'was', 'had', 'dies']
.:0.01:['of', 'on', 'from', 'Tuesday', 'aboard']
 Another:0.0:['The', 'She', '86', 'Ship', 'Doctors']
 86:0.9:['86', 'passenger', 'woman', '87', 'ship']
 passengers:0.67:['passengers', 'people', 'others', 'had', 'fell']
 had:0.19:['fell', 'had', 'were', 'previously', 'have']
 fallen:0.57:['fallen', 'previously', 'been', 'come', 'recently']
 ill:0.91:['ill', 'sick', 'down', 'off', 'out']
 on:0.2:['.', 'on', 'earlier', 'before', 'with']
 the:0.76:['the', 'board', 'previous', 'earlier', 'same']
 ship:0.38:['ship', 'same', 'MS', 'cruise', 'trip']
</s>:0.0:['.', 'before,', 'earlier', 'prior']

Each row represents a word in a summary, its probability, and topk=5 words from the model for comparison.

The result looks correct, but I wanted to confirm with others if I am on the right track?