Can I convert llama 2 "Chat" model into onnx using llama/convert_to_onnx.py script?

I have converted llama-2-13b-chat-hf model into onnx model using this script “(onnxruntime/onnxruntime/python/tools/transformers/models/llama at main · microsoft/onnxruntime · GitHub)
/convert_to_onnx.py”

i just had to make one change in the scipt to upgrade onnx opset version from 13 to 14. but now im seeing error

“[E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running RotaryEmbedding node. Name:‘RotaryEmbedding_0’ Status Message: Input ‘x’ is expected to have 3 dimensions, got 4
Exception in thread Thread-5 (generate):
Traceback (most recent call last):
File “/usr/lib/python3.10/threading.py”, line 1016, in _bootstrap_inner
self.run()
File “/usr/lib/python3.10/threading.py”, line 953, in run
self._target(*self._args, **self._kwargs)
File “/home/kainat/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 115, in decorate_context
return func(*args, **kwargs)
File “/home/kainat/.local/lib/python3.10/site-packages/transformers/generation/utils.py”, line 1764, in generate
return self.sample(
File “/home/kainat/.local/lib/python3.10/site-packages/transformers/generation/utils.py”, line 2861, in sample
outputs = self(
File “/home/kainat/.local/lib/python3.10/site-packages/optimum/modeling_base.py”, line 90, in call
return self.forward(*args, **kwargs)
File “/home/kainat/.local/lib/python3.10/site-packages/optimum/onnxruntime/modeling_decoder.py”, line 255, in forward
self.model.run_with_iobinding(io_binding)
File “/home/kainat/.local/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py”, line 331, in run_with_iobinding
self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while running RotaryEmbedding node. Name:‘RotaryEmbedding_0’ Status Message: Input ‘x’ is expected to have 3 dimensions, got 4”

on code

"from transformers import LlamaConfig, LlamaTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch

# User settings
model_name = "riazk/llama2-13b-merged-peft_kk"
onnx_model_dir = "./onnxruntime/onnxruntime/python/tools/transformers/llama2-13b-merged-int4-gpu/"
cache_dir = "./onnxruntime/onnxruntime/python/tools/transformers/model_cache"

device_id = 0
device = torch.device(f"cuda:{device_id}")  # Change to torch.device("cpu") if running on CPU

ep = "CUDAExecutionProvider"  # change to CPUExecutionProvider if running on CPU
ep_options = {"device_id": device_id}

prompt = ["ONNX Runtime is ", "I want to book a vacation to Hawaii. First, I need to ", "A good workout routine is ", "How are astronauts launched into space? "]
max_length = 64  # max(prompt length + generation length)

config = LlamaConfig.from_pretrained(model_name, use_auth_token=True, cache_dir=cache_dir)
config.save_pretrained(onnx_model_dir)  # Save config file in ONNX model directory
tokenizer = LlamaTokenizer.from_pretrained(model_name, use_auth_token=True, cache_dir=cache_dir)
tokenizer.pad_token = "[PAD]"

model = ORTModelForCausalLM.from_pretrained(
    onnx_model_dir,
    use_auth_token=True,
    use_io_binding=True,
    provider=ep,
    provider_options={"device_id": device_id}  # comment out if running on CPU
)
# inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)

print("-------------")
generate_ids = model.generate(**inputs, do_sample=False, max_length=max_length)
transcription = tokenizer.batch_decode(generate_ids, skip_special_tokens=True)
print(transcription)
print("-------------")"

i have tried this converting and inference script for llama-2-13b-hf model and it worked perfectly. so I was wondering is this error is because converting script is only optimized for llama model and not llama “chat” model or the fine-tuned llama2 model?
or If there is any solution for the error im facing?
I would love any help if someone has converted fine-tuned llama2 model into onnx model and it works.

1 Like

That’s surprising as the chat model just has the exact same architecture with different weights.
Can you try to export it with Optimum as described here?

I was able to convert it using optimum export as well but this way felt easier as it can optimize/quantize with one script. anyway I found another library microsoft/Olive and it is working on custom qlora fine-tuned llama2. Thanks!

I’m seeing the same issue trying to run the example from microsoft/Olive (Olive/examples/llama2 at main · microsoft/Olive · GitHub)

When I run python llama2.py --model_name meta-llama/Llama-2-7b-chat-hf from above direcory I get more or less the same error:

[...]
================ Diagnostic Run torch.onnx.export version 2.0.1 ================
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================


[2024-01-30 14:46:05,952] [INFO] [engine.py:929:_run_pass] Running pass transformers_optimization_fp32:OrtTransformersOptimization



2024-01-30 14:50:02.177478041 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running RotaryEmbedding node. Name:'RotaryEmbedding_1' Status Message: Input 'x' is expected to have 3 dimensions, got 4
[2024-01-30 14:50:02,433] [WARNING] [engine.py:436:run_accelerator] Failed to run Olive on cpu-cpu: Error in execution: Non-zero status code returned while running RotaryEmbedding node. Name:'RotaryEmbedding_1' Status Message: Input 'x' is expected to have 3 dimensions, got 4
Traceback (most recent call last):
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/engine/engine.py", line 425, in run_accelerator
    return self.run_search(
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/engine/engine.py", line 589, in run_search
    should_prune, signal, model_ids = self._run_passes(
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/engine/engine.py", line 908, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/engine/engine.py", line 1095, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/systems/local.py", line 47, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/evaluator/olive_evaluator.py", line 176, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/evaluator/olive_evaluator.py", line 711, in _evaluate_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/olive/evaluator/olive_evaluator.py", line 486, in _evaluate_onnx_latency
    session.run_with_iobinding(io_bind_op)
  File "/root/micromamba/envs/gen-ai/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 331, in run_with_iobinding
    self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while running RotaryEmbedding node. Name:'RotaryEmbedding_1' Status Message: Input 'x' is expected to have 3 dimensions, got 4

Any idea what is causing this?