@RaushanTurganbay This is the code sample:
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
output_hidden_states=True,
return_dict_in_generate=True
)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_path)
messages = [
{"role": "user", "content": prompt},
]
pipe = CustomPipeline(
model=model,
tokenizer=tokenizer,
)
With return_dict_in_generate=True
, i get the following output:
[
{'role': 'user', 'content': 'Hello How are you?'},
{
'role': 'assistant',
'content': " Hello! I'm doing well. How about you? How can I help you today? Hello! I'm an AI, so I don't ha
What can I do for you today? Greetings! As an AI, I don't have personal experiences, but I'm fully operational and r
assistance you need. What's on your mind?"
}
]
And when false:
You are not running the flash-attention implementation, expect numerical differences.
[{'role': 'user', 'content': 'Hello How are you?'}, {'role': 'assistant', 'content': " Hello! I'm doing well. How about you?