Difference in return sequence for Phi3 model

exploiter345 · June 10, 2024, 12:27am

I ran with flash_attention_2. Here are outputs with return_dict_in_generate on and off:

Off:

Loading checkpoint shards: 100%|████████████████| 2/2 [00:03<00:00,  1.60  
Special tokens have been added in the vocabulary, make sure the associate  
d embeddings are fine-tuned or trained.                                    
/opt/conda/lib/python3.10/site-packages/transformers/generation/configura  
utils.py:515: UserWarning: `do_sample` is set to `False`. However, `tempe  
e` is set to `0.0` -- this flag is only used in sample-based generation m  
 You should set `do_sample=True` or unset `temperature`.                   
  warnings.warn(                                                           
[                                                                          
    {'role': 'user', 'content': 'Hello how are you?'},                     
    {                                                                      
        'role': 'assistant',                                               
        'content': " Hello! I'm doing well. How about you? How can I help  
today?"                                                                    
    }                                                                      
]

On:

Loading checkpoint shards: 100%|█████████████| 2/2 [00:03<00:00,  1.65s/it]
Special tokens have been added in the vocabulary, make sure the associated 
word embeddings are fine-tuned or trained.                                 
/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:12
83: UserWarning: You have modified the pretrained model configuration to co
ntrol generation. This is a deprecated strategy to control generation and w
ill be removed soon, in a future version. Please use and modify the model g
eneration configuration (see https://huggingface.co/docs/transformers/gener
ation_strategies#default-text-generation-configuration )                   
  warnings.warn(                                                           
/opt/conda/lib/python3.10/site-packages/transformers/generation/configurati
on_utils.py:515: UserWarning: `do_sample` is set to `False`. However, `temp
erature` is set to `0.0` -- this flag is only used in sample-based generati
on modes. You should set `do_sample=True` or unset `temperature`.          
  warnings.warn(                                                           
[                                                                          
    {'role': 'user', 'content': 'Hello how are you?'},                     
    {                                                                      
        'role': 'assistant',                                               
        'content': " Hello! I'm doing well. How about you? How can I help  
you today? Hello! I'm just a computer program, but I'm functioning         
optimally. Thank you for asking! How can I assist you?"                    
    }                                                                      
]

The number of tokens in generated sequence are lengthier always with return_dict_in_generate set to True

Topic		Replies	Views
Generate() returns full prompt plus answer 🤗Transformers	1	6271	February 19, 2024
Model.generate generates same output for different inputs 🤗Transformers	1	619	November 13, 2023
Attention_mask missing from generate() output 🤗Transformers	0	197	November 16, 2023
Pipeline vs model.generate() Beginners	11	14119	July 16, 2025
Always output generation config in terminal Models	1	252	November 15, 2023

Difference in return sequence for Phi3 model

Related topics