Guide: The best way to calculate the perplexity of fixed-length models

joeddav · July 10, 2020, 5:07pm

Hey all. Just thought you might be interested in a page I just added to the research docs on the perplexity of fixed-length models.

Perplexity (PPL) is defined as the exponential average of a sequence’s negative log likelihoods. For a t-length sequence X, this is defined,

\text{PPL}(X) = \exp \left\{ -\frac{1}{t} \sum_i^t \log p_\theta (x_i|x_{<i}) \right\}

But with fixed-length models (like most transformers), we can’t always condition on the entire preceding subsequence when predicting each token.

The initial instinct for many in dealing with this problem is to break the whole sequence into segments equal to the model’s max input size and calculate the likelihoods of each segment independently. This not the best approach, however, since it gives the model very little context to use for prediction at the beginning of each segment. I’ll illustrate this with the following gif where we imagine a model with a max input size of 6 adding up the log-likelihoods for the sentence, “Hugging Face is a startup based in New York City and Paris”

When the model starts the second segment, it has to try to predict the word “in” without any context, even though we have 5 words before it that the model could be using (since we said the max input size is 6).

A better approach is to instead employ a sliding window strategy, where you continually move the context across the sequence, allowing the model to take advantage of the available context.

This is slower to compute, but will typically yield better scores and is actually much closer to the way the sequence probabilities are formally decomposed (e.g. see the the equation above).

In the guide, we show how to do this in a strided way with GPT-2. When using the first, naive approach, GPT-2 gets a PPL of 19.64 on WikiText-2. In contrast, when we use a strided sliding window, this score improves dramatically down to 16.53.

adatkins · October 20, 2020, 8:37pm

Hi, I have a question about the perplexity calculation from the guide.

Why do we divide by i in the example, see ppl = torch.exp(torch.stack(lls).sum() / i)?

If you have a codebase or paper that exemplifies this behaviour could you please share it?
Thanks!

joeddav · October 20, 2020, 10:01pm

Hmm yes, you should actually divide by encodings.input_ids.size(1) since i doesn’t account for the length of the last stride.

I also just spotted another bug. When the length of the last segment is less than stride, the log_likelihood calculation is slightly off. The difference in scores won’t be significant, but I’ve update the guide on master. This should be right:

max_length = model.config.n_positions
stride = 512

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        log_likelihood = outputs[0] * trg_len

    lls.append(log_likelihood)

ppl = torch.exp(torch.stack(lls).sum() / end_loc)

Does that answer your question?

adatkins · October 21, 2020, 2:02pm

yep thanks Joe!
I was thinking something similar but wanted to check in case I was missing something

sytelus · March 1, 2021, 10:57pm

Hi @joeddav - the input_ids and target_ids are the same. Shouldn’t target_ids be shifted by one?

sytelus · March 1, 2021, 11:09pm

Nevermind - just found out that labels are shifted inside the model and the loss for last one gets ignored.

labels ( torch.LongTensor of shape (batch_size, sequence_length) , optional) – Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids Indices are selected in [-100, 0, ..., config.vocab_size] All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

wcjord · July 15, 2021, 7:44pm

@joeddav I read and read the page several times. Thank you!

What would be the simplest way of accessing a perplexity score for a sentence and its parts? I’m building an application in NodeJS and hoping to access a perplexity score via an API - paid is fine for now. I think I could set up the Python model somewhere and expose it via an API but this hopefully will come later after some MVP testing.

Thank you again!

BramVanroy · October 16, 2021, 8:01pm

I am wondering whether this is still correct. So what you do is, for all input sequences:

neg_log_likelihood = outputs[0] * trg_len

Yet the first output of causal LMs is CrossEntropyLoss, not NLLL. So from that you can just get the mean CE loss from all sequences and get the exponential.

EDIT: that is also how it is implemented in the Trainer and run_clm.py script. First gather all losses for all batches in the whole validation set and take the mean.

github.com

huggingface/transformers/blob/11c69b80452fae4b13c6d8bc22bdc19f3a752199/src/transformers/trainer.py#L2353-L2354

    
      
          if all_losses is not None:
              metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()

Then take the exponential.

github.com

huggingface/transformers/blob/11c69b80452fae4b13c6d8bc22bdc19f3a752199/examples/pytorch/language-modeling/run_clm.py#L495

    
      
          
          
# Evaluation
          if training_args.do_eval:
              logger.info("*** Evaluate ***")
          
          
    metrics = trainer.evaluate()
          
          
    max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
              metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
              try:
                  perplexity = math.exp(metrics["eval_loss"])
              except OverflowError:
                  perplexity = float("inf")
              metrics["perplexity"] = perplexity
          
          
    trainer.log_metrics("eval", metrics)
              trainer.save_metrics("eval", metrics)
          
          
kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-generation"}
          if data_args.dataset_name is not None:
              kwargs["dataset_tags"] = data_args.dataset_name

Kornel · November 26, 2021, 3:26pm

I’d agree with @BramVanroy , any thoughts @joeddav on the above post?

I don’t understand the multiplication by trg_len in this example. Also on my dataset it explodes the perplexity by orders of magnitude above a uniform upper bound of log(|Vocab Size|)

naivebird · December 16, 2021, 3:36am

I think it is correct for Perplexity of fixed-length models since batch size is 1.

B.T.W. most libraries like simpletransformers implement perplexity calculation by taking exp(sum_of_loss_in_all_batches / num_of_batch) like simpletransformers/language_modeling_model.py at 254aaaa218635ef68f80ad1917403e7b7e24d710 · ThilinaRajapakse/simpletransformers · GitHub

Topic		Replies	Views
How to calculate perplexity properly Beginners	2	1499	October 27, 2021
Confused by calculation of perplexity in docs Beginners	0	656	September 28, 2021
Simplest measure of perplexity with available models? Beginners	0	267	July 28, 2021
Evaluation results in training GPT-2 on WikiText-2 Beginners	4	1627	April 14, 2021
Why is perplexity calculation giving different results for the same input? 🤗Transformers	0	541	May 6, 2023

Guide: The best way to calculate the perplexity of fixed-length models

Related topics