I don’t think this will be the case. Wouldn’t it be easier to just check the logits of the token that the model predicts after the prompt? This will be a distribution over all possible tokens of the vocabulary of Donut’s decoder, hence you can perhaps only look at the scores of the class tokens and normalize them to get a categorical distribution.