Error while Trying to run inference using gaudi on a finetuned llama2 model using habana repo

Using this repo here

  1. created a container image using the dockerfile mentioned in the instructions
  2. ran it on 2 computes : 1 HPU, 16 CPU and 32 GB Memory and 1 HPU, 50 CPU and 200 GB memory

Hi @gildesh, could you share the command you used to run inference please?

thanks for replying!

running it as an endpoint with these params: -

#model_config={“model_name”: “/root/.cache/intel/neural-chat-7b-v2”, “tokenizer_name”: “/root/.cache/intel/llama/neural-chat-7b-v2”, “device”: “hpu”, “use_hpu_graphs”: true, “peft_path”:“/input/finetune/output/peft_model”}

Could you share the generate.py file that is used in this endpoint?

Thanks!

I don’t have access to intel/neural-chat-7b-v2, it’s not on the Hugging Face Hub it seems. Do you have a config.json file somewhere? If yes, could you tell me the value of the field model_type please?
For instance, for Intel/neural-chat-7b-v1-1, I see that the model is based on MPT: config.json · Intel/neural-chat-7b-v1-1 at main

_name_or_path": “/models/llama-v2-latest-20230719/models_hf/Llama-2-7b”,
“architectures”: [
“LlamaForCausalLM”
],
“bos_token_id”: 1,
“eos_token_id”: 2,
“hidden_act”: “silu”,
“hidden_size”: 4096,
“initializer_range”: 0.02,
“intermediate_size”: 11008,
“max_position_embeddings”: 2048,
“model_type”: “llama”,

And which version of Optimum Habana do you use?

actually, we probably just use latest as we source it from this repo

just give this in requirements.txt
optimum

Could show me the output of pip show optimum-habana please?