I have converted llama-2-13b-chat-hf model into onnx model using this script “(onnxruntime/onnxruntime/python/tools/transformers/models/llama at main · microsoft/onnxruntime · GitHub)
/convert_to_onnx.py”
i just had to make one change in the scipt to upgrade onnx opset version from 13 to 14. but now im seeing error
“[E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running RotaryEmbedding node. Name:‘RotaryEmbedding_0’ Status Message: Input ‘x’ is expected to have 3 dimensions, got 4
Exception in thread Thread-5 (generate):
Traceback (most recent call last):
File “/usr/lib/python3.10/threading.py”, line 1016, in _bootstrap_inner
self.run()
File “/usr/lib/python3.10/threading.py”, line 953, in run
self._target(*self._args, **self._kwargs)
File “/home/kainat/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 115, in decorate_context
return func(*args, **kwargs)
File “/home/kainat/.local/lib/python3.10/site-packages/transformers/generation/utils.py”, line 1764, in generate
return self.sample(
File “/home/kainat/.local/lib/python3.10/site-packages/transformers/generation/utils.py”, line 2861, in sample
outputs = self(
File “/home/kainat/.local/lib/python3.10/site-packages/optimum/modeling_base.py”, line 90, in call
return self.forward(*args, **kwargs)
File “/home/kainat/.local/lib/python3.10/site-packages/optimum/onnxruntime/modeling_decoder.py”, line 255, in forward
self.model.run_with_iobinding(io_binding)
File “/home/kainat/.local/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py”, line 331, in run_with_iobinding
self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while running RotaryEmbedding node. Name:‘RotaryEmbedding_0’ Status Message: Input ‘x’ is expected to have 3 dimensions, got 4”
on code
"from transformers import LlamaConfig, LlamaTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch
# User settings
model_name = "riazk/llama2-13b-merged-peft_kk"
onnx_model_dir = "./onnxruntime/onnxruntime/python/tools/transformers/llama2-13b-merged-int4-gpu/"
cache_dir = "./onnxruntime/onnxruntime/python/tools/transformers/model_cache"
device_id = 0
device = torch.device(f"cuda:{device_id}") # Change to torch.device("cpu") if running on CPU
ep = "CUDAExecutionProvider" # change to CPUExecutionProvider if running on CPU
ep_options = {"device_id": device_id}
prompt = ["ONNX Runtime is ", "I want to book a vacation to Hawaii. First, I need to ", "A good workout routine is ", "How are astronauts launched into space? "]
max_length = 64 # max(prompt length + generation length)
config = LlamaConfig.from_pretrained(model_name, use_auth_token=True, cache_dir=cache_dir)
config.save_pretrained(onnx_model_dir) # Save config file in ONNX model directory
tokenizer = LlamaTokenizer.from_pretrained(model_name, use_auth_token=True, cache_dir=cache_dir)
tokenizer.pad_token = "[PAD]"
model = ORTModelForCausalLM.from_pretrained(
onnx_model_dir,
use_auth_token=True,
use_io_binding=True,
provider=ep,
provider_options={"device_id": device_id} # comment out if running on CPU
)
# inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)
print("-------------")
generate_ids = model.generate(**inputs, do_sample=False, max_length=max_length)
transcription = tokenizer.batch_decode(generate_ids, skip_special_tokens=True)
print(transcription)
print("-------------")"
i have tried this converting and inference script for llama-2-13b-hf model and it worked perfectly. so I was wondering is this error is because converting script is only optimized for llama model and not llama “chat” model or the fine-tuned llama2 model?
or If there is any solution for the error im facing?
I would love any help if someone has converted fine-tuned llama2 model into onnx model and it works.