Can I convert llama 2 "Chat" model into onnx using llama/convert_to_onnx.py script?

I have converted llama-2-13b-chat-hf model into onnx model using this script “(onnxruntime/onnxruntime/python/tools/transformers/models/llama at main · microsoft/onnxruntime · GitHub)
/convert_to_onnx.py”

i just had to make one change in the scipt to upgrade onnx opset version from 13 to 14. but now im seeing error

“[E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running RotaryEmbedding node. Name:‘RotaryEmbedding_0’ Status Message: Input ‘x’ is expected to have 3 dimensions, got 4
Exception in thread Thread-5 (generate):
Traceback (most recent call last):
File “/usr/lib/python3.10/threading.py”, line 1016, in _bootstrap_inner
self.run()
File “/usr/lib/python3.10/threading.py”, line 953, in run
self._target(*self._args, **self._kwargs)
File “/home/kainat/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 115, in decorate_context
return func(*args, **kwargs)
File “/home/kainat/.local/lib/python3.10/site-packages/transformers/generation/utils.py”, line 1764, in generate
return self.sample(
File “/home/kainat/.local/lib/python3.10/site-packages/transformers/generation/utils.py”, line 2861, in sample
outputs = self(
File “/home/kainat/.local/lib/python3.10/site-packages/optimum/modeling_base.py”, line 90, in call
return self.forward(*args, **kwargs)
File “/home/kainat/.local/lib/python3.10/site-packages/optimum/onnxruntime/modeling_decoder.py”, line 255, in forward
self.model.run_with_iobinding(io_binding)
File “/home/kainat/.local/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py”, line 331, in run_with_iobinding
self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while running RotaryEmbedding node. Name:‘RotaryEmbedding_0’ Status Message: Input ‘x’ is expected to have 3 dimensions, got 4”

on code

"from transformers import LlamaConfig, LlamaTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch

# User settings
model_name = "riazk/llama2-13b-merged-peft_kk"
onnx_model_dir = "./onnxruntime/onnxruntime/python/tools/transformers/llama2-13b-merged-int4-gpu/"
cache_dir = "./onnxruntime/onnxruntime/python/tools/transformers/model_cache"

device_id = 0
device = torch.device(f"cuda:{device_id}")  # Change to torch.device("cpu") if running on CPU

ep = "CUDAExecutionProvider"  # change to CPUExecutionProvider if running on CPU
ep_options = {"device_id": device_id}

prompt = ["ONNX Runtime is ", "I want to book a vacation to Hawaii. First, I need to ", "A good workout routine is ", "How are astronauts launched into space? "]
max_length = 64  # max(prompt length + generation length)

config = LlamaConfig.from_pretrained(model_name, use_auth_token=True, cache_dir=cache_dir)
config.save_pretrained(onnx_model_dir)  # Save config file in ONNX model directory
tokenizer = LlamaTokenizer.from_pretrained(model_name, use_auth_token=True, cache_dir=cache_dir)
tokenizer.pad_token = "[PAD]"

model = ORTModelForCausalLM.from_pretrained(
    onnx_model_dir,
    use_auth_token=True,
    use_io_binding=True,
    provider=ep,
    provider_options={"device_id": device_id}  # comment out if running on CPU
)
# inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)

print("-------------")
generate_ids = model.generate(**inputs, do_sample=False, max_length=max_length)
transcription = tokenizer.batch_decode(generate_ids, skip_special_tokens=True)
print(transcription)
print("-------------")"

i have tried this converting and inference script for llama-2-13b-hf model and it worked perfectly. so I was wondering is this error is because converting script is only optimized for llama model and not llama “chat” model or the fine-tuned llama2 model?
or If there is any solution for the error im facing?
I would love any help if someone has converted fine-tuned llama2 model into onnx model and it works.

That’s surprising as the chat model just has the exact same architecture with different weights.
Can you try to export it with Optimum as described here?

I was able to convert it using optimum export as well but this way felt easier as it can optimize/quantize with one script. anyway I found another library microsoft/Olive and it is working on custom qlora fine-tuned llama2. Thanks!

I’m seeing the same issue trying to run the example from microsoft/Olive (Olive/examples/llama2 at main · microsoft/Olive · GitHub)

When I run python llama2.py --model_name meta-llama/Llama-2-7b-chat-hf from above direcory I get more or less the same error:

[...]
================ Diagnostic Run torch.onnx.export version 2.0.1 ================
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================


[2024-01-30 14:46:05,952] [INFO] [engine.py:929:_run_pass] Running pass transformers_optimization_fp32:OrtTransformersOptimization



2024-01-30 14:50:02.177478041 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running RotaryEmbedding node. Name:'RotaryEmbedding_1' Status Message: Input 'x' is expected to have 3 dimensions, got 4
[2024-01-30 14:50:02,433] [WARNING] [engine.py:436:run_accelerator] Failed to run Olive on cpu-cpu: Error in execution: Non-zero status code returned while running RotaryEmbedding node. Name:'RotaryEmbedding_1' Status Message: Input 'x' is expected to have 3 dimensions, got 4
Traceback (most recent call last):
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/engine/engine.py", line 425, in run_accelerator
    return self.run_search(
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/engine/engine.py", line 589, in run_search
    should_prune, signal, model_ids = self._run_passes(
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/engine/engine.py", line 908, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/engine/engine.py", line 1095, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/systems/local.py", line 47, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/evaluator/olive_evaluator.py", line 176, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/evaluator/olive_evaluator.py", line 711, in _evaluate_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/evaluator/olive_evaluator.py", line 486, in _evaluate_onnx_latency
    session.run_with_iobinding(io_bind_op)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 331, in run_with_iobinding
    self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while running RotaryEmbedding node. Name:'RotaryEmbedding_1' Status Message: Input 'x' is expected to have 3 dimensions, got 4

Any idea what is causing this?