Guide: The best way to calculate the perplexity of fixed-length models

Hey all. Just thought you might be interested in a page I just added to the research docs on the perplexity of fixed-length models.

Perplexity (PPL) is defined as the exponential average of a sequence’s negative log likelihoods. For a t-length sequence X, this is defined,

\text{PPL}(X) = \exp \left\{ -\frac{1}{t} \sum_i^t \log p_\theta (x_i|x_{<i}) \right\}

But with fixed-length models (like most transformers), we can’t always condition on the entire preceding subsequence when predicting each token.

The initial instinct for many in dealing with this problem is to break the whole sequence into segments equal to the model’s max input size and calculate the likelihoods of each segment independently. This not the best approach, however, since it gives the model very little context to use for prediction at the beginning of each segment. I’ll illustrate this with the following gif where we imagine a model with a max input size of 6 adding up the log-likelihoods for the sentence, “Hugging Face is a startup based in New York City and Paris”

When the model starts the second segment, it has to try to predict the word “in” without any context, even though we have 5 words before it that the model could be using (since we said the max input size is 6).

A better approach is to instead employ a sliding window strategy, where you continually move the context across the sequence, allowing the model to take advantage of the available context.

This is slower to compute, but will typically yield better scores and is actually much closer to the way the sequence probabilities are formally decomposed (e.g. see the the equation above).

In the guide, we show how to do this in a strided way with GPT-2. When using the first, naive approach, GPT-2 gets a PPL of 19.64 on WikiText-2. In contrast, when we use a strided sliding window, this score improves dramatically down to 16.53.


Hi, I have a question about the perplexity calculation from the guide.

Why do we divide by i in the example, see ppl = torch.exp(torch.stack(lls).sum() / i)?

If you have a codebase or paper that exemplifies this behaviour could you please share it?

Hmm yes, you should actually divide by encodings.input_ids.size(1) since i doesn’t account for the length of the last stride.

I also just spotted another bug. When the length of the last segment is less than stride, the log_likelihood calculation is slightly off. The difference in scores won’t be significant, but I’ve update the guide on master. This should be right:

max_length = model.config.n_positions
stride = 512

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        log_likelihood = outputs[0] * trg_len


ppl = torch.exp(torch.stack(lls).sum() / end_loc)

Does that answer your question?


yep thanks Joe!
I was thinking something similar but wanted to check in case I was missing something

1 Like

Hi @joeddav - the input_ids and target_ids are the same. Shouldn’t target_ids be shifted by one?

Nevermind - just found out that labels are shifted inside the model and the loss for last one gets ignored.

labels ( torch.LongTensor of shape (batch_size, sequence_length) , optional) – Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids Indices are selected in [-100, 0, ..., config.vocab_size] All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]