Using this repo here
- created a container image using the dockerfile mentioned in the instructions
- ran it on 2 computes : 1 HPU, 16 CPU and 32 GB Memory and 1 HPU, 50 CPU and 200 GB memory
Using this repo here
Hi @gildesh, could you share the command you used to run inference please?
thanks for replying!
running it as an endpoint with these params: -
#model_config={“model_name”: “/root/.cache/intel/neural-chat-7b-v2”, “tokenizer_name”: “/root/.cache/intel/llama/neural-chat-7b-v2”, “device”: “hpu”, “use_hpu_graphs”: true, “peft_path”:“/input/finetune/output/peft_model”}
Could you share the generate.py
file that is used in this endpoint?
Thanks!
I don’t have access to intel/neural-chat-7b-v2
, it’s not on the Hugging Face Hub it seems. Do you have a config.json
file somewhere? If yes, could you tell me the value of the field model_type
please?
For instance, for Intel/neural-chat-7b-v1-1
, I see that the model is based on MPT: config.json · Intel/neural-chat-7b-v1-1 at main
_name_or_path": “/models/llama-v2-latest-20230719/models_hf/Llama-2-7b”,
“architectures”: [
“LlamaForCausalLM”
],
“bos_token_id”: 1,
“eos_token_id”: 2,
“hidden_act”: “silu”,
“hidden_size”: 4096,
“initializer_range”: 0.02,
“intermediate_size”: 11008,
“max_position_embeddings”: 2048,
“model_type”: “llama”,
And which version of Optimum Habana do you use?
actually, we probably just use latest as we source it from this repo
just give this in requirements.txt
optimum
Could show me the output of pip show optimum-habana
please?