Difference in return sequence for Phi3 model

Hi, I am getting different sequence for microsoft/Phi-3-mini-4k-instruct when i set return_dict_in_generate=True in model.generate?

Example output:

    {'role': 'user', 'content': 'Hello How are you?'},                                                                                 
        'role': 'assistant',                                                                                                           
        'content': " Hello! I'm doing well. How about you? How can I help you today? Hello! I'm an AI, so I don't have feelings, but   
I'm here and ready to assist you. What can I do for you today? Greetings! As an AI, I don't have personal experiences, but I'm fully   
operational and ready to provide you with any information or assistance you need. What's on your mind?"                                

Hey! Can you please explain what do you mean by “different sequence” and provide a minial reproducible code?

In general, using “return_dict” or not should not affect what text will be generated unless you’re setting up 'do_sample=True" in generation.

@RaushanTurganbay This is the code sample:

model = AutoModelForCausalLM.from_pretrained(
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_path)
    messages = [
        {"role": "user", "content": prompt},
    pipe = CustomPipeline(

With return_dict_in_generate=True, i get the following output:

    {'role': 'user', 'content': 'Hello How are you?'},                                                              
        'role': 'assistant',                                                                                        
        'content': " Hello! I'm doing well. How about you? How can I help you today? Hello! I'm an AI, so I don't ha
What can I do for you today? Greetings! As an AI, I don't have personal experiences, but I'm fully operational and r
assistance you need. What's on your mind?"                                                                          

And when false:

You are not running the flash-attention implementation, expect numerical differences.                               
[{'role': 'user', 'content': 'Hello How are you?'}, {'role': 'assistant', 'content': " Hello! I'm doing well. How about you?

@amyeroberts @joaogante are you aware if this is an issue? saw your discussions in some of the PRs


Can you please provide a more extensive reproducer as you’re not calling the pipeline in the snippet above. I also see a warning regarding Flash Attention which might explain the differences.

@nielsr Here you go

import sys
import pdb
import torch
from rich import print
from transformers import AutoTokenizer, Phi3ForCausalLM, AutoModelForCausalLM, pipeline
from transformers import TextGenerationPipeline


class CustomPipeline(TextGenerationPipeline):

    def _forward(self, model_inputs, **generate_kwargs):
        input_ids = model_inputs["input_ids"]
        attention_mask = model_inputs.get("attention_mask", None)
        # Allow empty prompts
        if input_ids.shape[1] == 0:
            input_ids = None
            attention_mask = None
            in_b = 1
            in_b = input_ids.shape[0]
        prompt_text = model_inputs.pop("prompt_text")

        # If there is a prefix, we may need to adjust the generation length. Do so without permanently modifying
        # generate_kwargs, as some of the parameterization may come from the initialization of the pipeline.
        prefix_length = generate_kwargs.pop("prefix_length", 0)
        if prefix_length > 0:
            has_max_new_tokens = "max_new_tokens" in generate_kwargs or (
                "generation_config" in generate_kwargs
                and generate_kwargs["generation_config"].max_new_tokens is not None
            if not has_max_new_tokens:
                generate_kwargs["max_length"] = generate_kwargs.get("max_length") or self.model.config.max_length
                generate_kwargs["max_length"] += prefix_length
            has_min_new_tokens = "min_new_tokens" in generate_kwargs or (
                "generation_config" in generate_kwargs
                and generate_kwargs["generation_config"].min_new_tokens is not None
            if not has_min_new_tokens and "min_length" in generate_kwargs:
                generate_kwargs["min_length"] += prefix_length

        # BS x SL
        output_dict = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
        generated_sequence = output_dict.sequences if self.model.config.return_dict_in_generate else output_dict
        hidden_states = output_dict.hidden_states if self.model.config.return_dict_in_generate else None
        out_b = generated_sequence.shape[0]
        if self.framework == "pt":
            generated_sequence = generated_sequence.reshape(in_b, out_b // in_b, *generated_sequence.shape[1:])
        elif self.framework == "tf":
            generated_sequence = tf.reshape(generated_sequence, (in_b, out_b // in_b, *generated_sequence.shape[1:]))
        return {"generated_sequence": generated_sequence, "input_ids": input_ids, "prompt_text": prompt_text, "hidden_states": hidden_states}

    def postprocess(self, model_outputs, **kwargs):
        generated_sequence = model_outputs["generated_sequence"][0]
        input_ids = model_outputs["input_ids"]
        prompt_text = model_outputs["prompt_text"]
        generated_sequence = generated_sequence.numpy().tolist()
        records = []
        for sequence in generated_sequence:
            # Decode text
            text = self.tokenizer.decode(
                clean_up_tokenization_spaces=kwargs.get("clean_up_tokenization_spaces", True),

            # Remove PADDING prompt of the sequence if XLNet or Transfo-XL model is used
            if input_ids is None:
                prompt_length = 0
                prompt_length = len(
                        clean_up_tokenization_spaces=kwargs.get("clean_up_tokenization_spaces", True),

            all_text = text[prompt_length:]
            if isinstance(prompt_text, str):
                all_text = prompt_text + all_text
                # Explicit list parsing is necessary for parsing chat datasets
                all_text = list(prompt_text.messages) + [{"role": "assistant", "content": all_text}]

            record = {"generated_text": all_text}

        return records, model_outputs.get("hidden_states", None)

def run_hf_pipeline(prompt: str, pretrained_model_path: str, tokenizer_model_path: str) -> str:
    model = AutoModelForCausalLM.from_pretrained(
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_path)
    messages = [
        {"role": "user", "content": prompt},
    pipe = CustomPipeline(
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.0,
        "do_sample": False,
    output, hidden_states = pipe(messages, **generation_args)

    return output[0]['generated_text']

def run(prompt: str, pretrained_model_path: str, tokenizer_model_path: str) -> str:
    model = Phi3ForCausalLM.from_pretrained(pretrained_model_path)
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_path)

    inputs = tokenizer(prompt, return_tensors="pt")
    generate_ids = model.generate(inputs.input_ids, max_length=30)
    return tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

if __name__ == "__main__":
    prompt = sys.argv[1]
    pretrained_model_path = sys.argv[2]
    tokenizer_model_path = sys.argv[3] if len(sys.argv) > 3 else pretrained_model_path
    print(run_hf_pipeline(prompt, pretrained_model_path, tokenizer_model_path))

The Flash Attention warning is observed in both cases, but let me see if i can avoid it

I ran with flash_attention_2. Here are outputs with return_dict_in_generate on and off:


Loading checkpoint shards: 100%|████████████████| 2/2 [00:03<00:00,  1.60  
Special tokens have been added in the vocabulary, make sure the associate  
d embeddings are fine-tuned or trained.                                    
utils.py:515: UserWarning: `do_sample` is set to `False`. However, `tempe  
e` is set to `0.0` -- this flag is only used in sample-based generation m  
 You should set `do_sample=True` or unset `temperature`.                   
    {'role': 'user', 'content': 'Hello how are you?'},                     
        'role': 'assistant',                                               
        'content': " Hello! I'm doing well. How about you? How can I help  


Loading checkpoint shards: 100%|█████████████| 2/2 [00:03<00:00,  1.65s/it]
Special tokens have been added in the vocabulary, make sure the associated 
word embeddings are fine-tuned or trained.                                 
83: UserWarning: You have modified the pretrained model configuration to co
ntrol generation. This is a deprecated strategy to control generation and w
ill be removed soon, in a future version. Please use and modify the model g
eneration configuration (see https://huggingface.co/docs/transformers/gener
ation_strategies#default-text-generation-configuration )                   
on_utils.py:515: UserWarning: `do_sample` is set to `False`. However, `temp
erature` is set to `0.0` -- this flag is only used in sample-based generati
on modes. You should set `do_sample=True` or unset `temperature`.          
    {'role': 'user', 'content': 'Hello how are you?'},                     
        'role': 'assistant',                                               
        'content': " Hello! I'm doing well. How about you? How can I help  
you today? Hello! I'm just a computer program, but I'm functioning         
optimally. Thank you for asking! How can I assist you?"                    

The number of tokens in generated sequence are lengthier always with return_dict_in_generate set to True

I think the change in generation config happening here might be the reason:

GenerationConfig {                                                         
  "bos_token_id": 1,                                                       
  "eos_token_id": [                                                        
  "pad_token_id": 32000                                                    
(Pdb) n                                                                    
> /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py(
-> if new_generation_config != self.generation_config:                     
(Pdb) p new_generation_config                                              
GenerationConfig {                                                         
  "bos_token_id": 1,                                                       
  "eos_token_id": 32000,                                                   
  "output_hidden_states": true,                                            
  "pad_token_id": 32000,                                                   
  "return_dict_in_generate": true                                          

Confirming that following change fixes the issue. The key change is adding eos_token_id

@exploiter345 thanks for a detailed analysis, seems like it’s related to this issue. The fix is already merged into main, so the code should work without providing “eos_token_id” explicitly in the generation_config after updating transformers:

!pip install --upgrade git+https://github.com/huggingface/transformers.git

