Guide: The best way to calculate the perplexity of fixed-length models

Hey all. Just thought you might be interested in a page I just added to the research docs on the perplexity of fixed-length models.

Perplexity (PPL) is defined as the exponential average of a sequence’s negative log likelihoods. For a t-length sequence X, this is defined,

\text{PPL}(X) = \exp \left\{ -\frac{1}{t} \sum_i^t \log p_\theta (x_i|x_{<i}) \right\}

But with fixed-length models (like most transformers), we can’t always condition on the entire preceding subsequence when predicting each token.

The initial instinct for many in dealing with this problem is to break the whole sequence into segments equal to the model’s max input size and calculate the likelihoods of each segment independently. This not the best approach, however, since it gives the model very little context to use for prediction at the beginning of each segment. I’ll illustrate this with the following gif where we imagine a model with a max input size of 6 adding up the log-likelihoods for the sentence, “Hugging Face is a startup based in New York City and Paris”

When the model starts the second segment, it has to try to predict the word “in” without any context, even though we have 5 words before it that the model could be using (since we said the max input size is 6).

A better approach is to instead employ a sliding window strategy, where you continually move the context across the sequence, allowing the model to take advantage of the available context.

This is slower to compute, but will typically yield better scores and is actually much closer to the way the sequence probabilities are formally decomposed (e.g. see the the equation above).

In the guide, we show how to do this in a strided way with GPT-2. When using the first, naive approach, GPT-2 gets a PPL of 19.64 on WikiText-2. In contrast, when we use a strided sliding window, this score improves dramatically down to 16.53.

9 Likes

Hi, I have a question about the perplexity calculation from the guide.

Why do we divide by i in the example, see ppl = torch.exp(torch.stack(lls).sum() / i)?

If you have a codebase or paper that exemplifies this behaviour could you please share it?
Thanks!

Hmm yes, you should actually divide by encodings.input_ids.size(1) since i doesn’t account for the length of the last stride.

I also just spotted another bug. When the length of the last segment is less than stride, the log_likelihood calculation is slightly off. The difference in scores won’t be significant, but I’ve update the guide on master. This should be right:

max_length = model.config.n_positions
stride = 512

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        log_likelihood = outputs[0] * trg_len

    lls.append(log_likelihood)

ppl = torch.exp(torch.stack(lls).sum() / end_loc)

Does that answer your question?

3 Likes

yep thanks Joe!
I was thinking something similar but wanted to check in case I was missing something

1 Like

Hi @joeddav - the input_ids and target_ids are the same. Shouldn’t target_ids be shifted by one?

Nevermind - just found out that labels are shifted inside the model and the loss for last one gets ignored.

labels ( torch.LongTensor of shape (batch_size, sequence_length) , optional) – Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids Indices are selected in [-100, 0, ..., config.vocab_size] All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

2 Likes

@joeddav I read and read the page several times. Thank you!

What would be the simplest way of accessing a perplexity score for a sentence and its parts? I’m building an application in NodeJS and hoping to access a perplexity score via an API - paid is fine for now. I think I could set up the Python model somewhere and expose it via an API but this hopefully will come later after some MVP testing.

Thank you again!

I am wondering whether this is still correct. So what you do is, for all input sequences:

neg_log_likelihood = outputs[0] * trg_len

Yet the first output of causal LMs is CrossEntropyLoss, not NLLL. So from that you can just get the mean CE loss from all sequences and get the exponential.

EDIT: that is also how it is implemented in the Trainer and run_clm.py script. First gather all losses for all batches in the whole validation set and take the mean.

Then take the exponential.

1 Like

I’d agree with @BramVanroy , any thoughts @joeddav on the above post?

I don’t understand the multiplication by trg_len in this example. Also on my dataset it explodes the perplexity by orders of magnitude above a uniform upper bound of log(|Vocab Size|) :slight_smile:

I think it is correct for Perplexity of fixed-length models since batch size is 1.

B.T.W. most libraries like simpletransformers implement perplexity calculation by taking exp(sum_of_loss_in_all_batches / num_of_batch) like simpletransformers/language_modeling_model.py at 254aaaa218635ef68f80ad1917403e7b7e24d710 · ThilinaRajapakse/simpletransformers · GitHub