Roberta vs Bart perplexity calcuation

I need to create a perplexity score for Bart and I’ve looked at a few examples (ie… https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling run_mlm.py for roberta) Unfortunately this doesn’t work for Bart and gives very high perplexity scores. The root of the issue seems to be the use of -100 ids for all non-masked label tokens. The only way I have found around this is to keep the labels id as tokenized (no -100 masking) and then manually extract the logits for the specific mask locations and do a little math. This gives me about a 20 perplexity which seems high considering Roberta gives about 6 with the code from the link above. Anyone know of code to do this calculation for Bart or have advice on the simple/correct way to do it?

Looks like the trick is to pass in manually created decoder_input_ids to the model. If these aren’t passed in Bart creates them from labels and since most of those are -100, that messes up the decoding process. Also note that I think the run_mlm.py script isn’t correctly placing the bos/eos tokens. To get Bart to score properly I had to tokenize, segment for length and then manually add these tokens back into each batch sequence.

I don’t have experience particularly calculating perplexity by hand for BART. Below is the code snippet I used for GPT-2. Setting all the padded tokens (or tokens you don’t want to include in the perplexity) to -100 works.

        tgt_len = claim_inp.input_ids.size(1)
        input_ids = torch.cat([evidence_inp.input_ids, claim_inp.input_ids], axis=-1).to(device)
        target_ids = input_ids.clone()
        # mask the evidence so they're not considered when calculating the perplexity
        target_ids[:, :-tgt_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)

Also there’s a link from Huggingface that I found very useful:

Thanks for the info. Just FYI and for anyone else interested…
What works for GPT-2 (and the solution in the fixed length models link) doesn’t work for Bart. With Bart you need to pass in the decoder_input_ids instead of letting the model create them from the labels. The code is something like…

        input_ids = [[bos] + sample + [eos] for sample in samples]
        decoder_input_ids = [[dst] + iids[:-1] for iids   in input_ids]  # shift_tokens_right

where samples are the tokenized and chunked (ie… length = 512) text without the bos/eos tokens added. You could then manually mask the input_ids and create the ‘labels’ but there’s also a collator that will do this for you. You can just use… DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15). The output from the DataLoader will then have the randomly masked input_ids and the labels with -100 in the appropriate locations. Pass in these and the decoder_input_ids to the model and use perplexity = math.exp(statistics.mean(losses)).

Here’s the full code Bart Token Level Perplexity