Stochastic Sampling with Trainer.evaluate() Logits

pw45000 · May 3, 2024, 8:18am

Hello all, I was recently using the evaluate function of the Huggingface Trainer to evaluate an older finetune of GPT-2, specifically to compare its performance among different decoding methods with a compute_metrics callback.

The issue is that the evaluate function returns the raw logits of the predictions, which necessitates manually implementing decoding strategies. While this is relatively simple for greedy decoding(and has been documented here), I have found little information for decoding the prediction logits via stochastic sampling (specifically top_p and top_k).

Is there an easy way to decode the raw prediction logits utilizing stochastic sampling?

terjenf · May 5, 2024, 4:01am

Hi, I am looking into the same issue, but I am also interested in evaluating with beam search. Do you have an understanding if the already computed logits are done with a default greedy technique, as GPT-2 is an autoregressive model and the next logits depends on the previous? If so it wouldn’t make sense trying any other sampling techniques in the compute_metrics function, than greedy?

pw45000 · May 6, 2024, 2:05am

Hello, thanks for your reply. Over the last few days, I mostly came to the conclusion that:

TL;DR You have two options with regard to evaluation:

Skipping using Trainer.evaluate() all together and using model.generate() and comparing against your dataset’s labels
Implementing parts of model.generate() into the decoding process (whether it be in the compute_metrics or preprocess_logits_for_metrics callbacks, or injecting support for model.generate() directly into Trainer.predict_step similar to how the Seq2SeqTrainer Class) achieves this.

Given the hacky nature of the latter and that similar classes(Seq2SeqTrainer) use model.generate() anyways, I chose the former. For anyone still interested in the finer details of the issue, however, I still have many observations down below:

The decoding snippet from the article I linked earlier np.argmax(logits, axis=-1) seems to choose the most probable token given an input sequence input_ids, so for instance, given a 1x1024x50267 (1 batch x 1024 token long input sequence x 50267 vocab size) Tensor, it’ll return a 1x1024 Tensor.

It looks greedy, but does not seem autoregressive. I mostly say this as looking through an old version of the generate function (mostly as the current generate function is much more complex) appears to give a clear example of autoregressive decoding looks like:

 while cur_len < max_length:
            model_inputs = self.prepare_inputs_for_generation(
                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs
            )

            outputs = self(**model_inputs)
            next_token_logits = outputs[0][:, -1, :]
            # ... (truncating other details like caches, attention masks, etc..) 
               if do_sample:
                # Temperature (higher temperature => more likely to sample low probability tokens)
                if temperature != 1.0:
                    scores = scores / temperature
                # Top-p/top-k filtering
                next_token_logscores = top_k_top_p_filtering(scores, top_k=top_k, top_p=top_p)
                # Sample
                probs = F.softmax(next_token_logscores, dim=-1)
                next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
            else:
                # Greedy decoding
                next_token = torch.argmax(next_token_logits, dim=-1)

Regardless of the decoding method in model.generate(), each generation(even in greedy decoding) depends on a gradually increasing input_ids Tensor, vs taking the argmax of the logits only once. Probably the easiest way to utilize the returned logits from the Trainer without changing the Trainer class would be to implement something similar to the old generate function I showed above. (Sidenote: I believe the lack of a returned attention mask does not matter).

Interestingly, the Seq2SeqTrainer class does support generating text during evaluation while the Trainer class does not:

generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs)
# ... Passing over details like padding, adjusting length of the generations for the labels, etc..
        # Note how these generations are not what is calculated in the loss, just the logits 
        with torch.no_grad():
            if has_labels:
                with self.compute_loss_context_manager():
                    outputs = model(**inputs)
                if self.label_smoother is not None:
                    loss = self.label_smoother(outputs, inputs["labels"]).mean().detach()
                else:
                    loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[0]).mean().detach()
            else:
                loss = None
return loss, generated_tokens, labels

I am not sure why the Trainer class lacks this feature, though if someone were to want to override the Trainer.prediction_step() function for model.generate() support, the Seq2SeqTrainer.prediction_step() function would be a good place to start.

system · May 11, 2024, 4:35am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Implementing the REINFORCE algorithm for encoder-decoder model Intermediate	1	676	March 14, 2022
Can `Trainer` be customised for curriculum learning? 🤗Transformers	0	946	August 26, 2022
How to interpret logit score from Hugging face binary classification model and convert it to probability sore Models	0	1518	December 20, 2021
Evaluating huggingface transformer with trainer gives different results 🤗Transformers	0	916	March 22, 2023
Evaluate Model on Test dataset (PPL) Beginners	3	1483	June 10, 2021

Stochastic Sampling with Trainer.evaluate() Logits

Related topics