ORT CLI vs. Programmatic

Hi,

As an experiment, I am running a conversion/optimization from PyTorch to ONNX with dolly-v2-3b. The CLI approach ran successfully for optimization O3 on a Dell 7540 (Xeon E-2286M x 16; 64GB memory). It resulted in a 15% reduction when compared to a not optimized conversion.

When attempting the same via a programmatic approach the program gets killed by the Kernel as it exceeds available memory resources (100% RAM and Swap). This happens when the optimze() method of ORTOptimizer is called. The last output seen is “Optimizing model…”. The configuration was generated with AutoOptimizationConfigO3().

With the CLI execution, memory use never exceeds 80%, with about 25% of Swap. Hence, I have a few questions:

Is there a guide as to required resources for programmatic and CLI ORT Optimum execution?
What makes it require more resources under the API based optimization?
In both cases optimization is the same with Level O3, is there more optimization techniques applied with the programmatic approach as compared to the CLI approach?

Thanks,
Borell

Hi Borell! Could you share the CLI commands and the script you used to compare both approaches please?

Hi regisss,

Thanks for the reply, I can certainly send you the code. I wrote a Python module which runs CLI or optionally programmatic code. Doesn’t seem that there is an option to attach files. I can paste the code in a reply if that is OK, it is not very long.

Sorry my first time using this forum.

BBorell

No worries, it’s even better to directly share the code here surrounding your code snippets with ``` :slight_smile:

regisss,

Here is the code. To run with CLI, change cliBool = False to True and change the location path “model_save”. Comparison of size is then done manually by verifying the size of files created at path model_save.

The ultimate objective is not Dolly but something like a Flan-T5 XXX. I am exploring the feasibility of using ORT. I attempted earlier with OpenVINO, but their implementation crashed. It is now being evaluated by Intel developers after they confirmed the issue with their own sample code.

‘’’
#!/usr/bin/env python

#standard imports
import os
import subprocess
from pathlib import Path

#third-party imports
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
from optimum.pipelines import pipeline
import torch
from tensorflow.python.ops.initializers_ns import variables
#from optimum.onnxruntime import OptimizationConfig, ORTOptimizer #deferred to optimization

cliBool = False

#getting from huggingface.co
model_remote = ‘databricks/dolly-v2-3b’
model_path = Path(‘~/.cache/huggingface/hub/models–databricks–dolly-v2-3b’).expanduser()
if model_path.exists():
print(“************* 1: Getting local PyTorch Model”)
#… from local cache; process does not expand user path; also, needs to point at config.json file location
model_path = model_path / ‘snapshots/f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df’
#saving to: /home/jellybean/.cache/huggingface/hub/models–databricks–dolly-v2-3b/snapshots/f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df/optimum/onnx_rt
model_save = model_path / ‘optimum/onnx_rt_optimized’
else:
print(“************* 1: Getting Hugging Face PyTorch Model”)
model_path = model_remote
#mannually move it to cache snapshot upon created
model_save = Path(‘~/.cache/huggingface/hub/’).expanduser() / ‘optimum/onnx_rt’

GPU not enough memory: the following reports before onnx conversion starts

#mem = torch.cuda.mem_get_info()
#print(f"CUDA Memory - Available: {mem[0]} Total: {mem[1]}“)
#print(f"CUDA Memory - Total: {torch.cuda.get_device_properties(0).total_memory} Reserved: {torch.cuda.memory_reserved(0)} Allocated: {torch.cuda.memory_allocated(0)}”)
#print(f"{torch.cuda.memory_summary(device=0, abbreviated=True)}")

#check file exist so we don’t convert again
config = model_save / ‘config.json’

‘’‘torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 76.00 MiB…
… (GPU 0; 5.79 GiB total capacity; 5.26 GiB already allocated; 69.50 MiB free;…
… 5.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory…
… try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF…
… but os.environ[‘PYTORCH_CUDA_ALLOC_CONF’] = ‘max_split_size_mb:21’ still gets OutOfMemoryError
cli = ‘optimum-cli export onnx -m {model_path} --cache_dir {model_save} --task text-generation
cli += f’–framework pt --device cuda --optimize O4 --batch_size 32 {model_save}’
Cause: --device cuda --optimize O4
‘’’
if not config.exists():
if cliBool:
print(“************* 2: CLI-ONNX Convert/Optimize”)
#switching to CPU; need to specify --task if not from huggingface; optimizing to Level O3…
#… export dolly to onnx: > optimum-cli export onnx --help
cli = f’optimum-cli export onnx -m {model_path} --task text-generation-with-past --framework pt --no-post-process --optimize O3 {model_save}’
with subprocess.Popen([cli], shell=True, stdout=subprocess.PIPE, stdin=subprocess.PIPE) as proc:
_, _ = proc.communicate()
else:
print(“************* 2: Programmatic-ONNX Converte/Optimize ONNX”)
from optimum.onnxruntime import AutoOptimizationConfig, ORTOptimizer

	print("************* 2.a")
	model = ORTModelForCausalLM.from_pretrained(model_path, export=True)
	print("************* 2.b")
	optimizer = ORTOptimizer.from_pretrained(model)
	print("************* 2.c")
	#optimizing to Level O3
	optimization_config = AutoOptimizationConfig.O3()
	print(optimization_config)
	raise SystemExit
	print("************* 2.d")
	optimizer.optimize(save_dir=model_save, optimization_config=optimization_config)

else: print(“************* 2: Skipping ONNX Convertion/Optimization”)

#inference using pt to onnx model
print(“************* 3: ONNX Model Inference”)
tokenizer = AutoTokenizer.from_pretrained(model_save)
model_ort = ORTModelForCausalLM.from_pretrained(model_save)
ort_pipe = pipeline(‘text-generation’, model=model_ort, tokenizer=tokenizer, accelerator=‘ort’, framework=‘pt’, device=-1, model_kwargs={“load_in_8bit”: True})
#try 4 different prompts
prompt1 = [‘Explain to me what is love.’, ‘You are an idiot’] #produces a list of two lists with dicts
prompt2 = [‘Explain to me what is hate.’] #produces a list of one list with dict
prompt3 = ‘Do you love yourself?.’ #produces a list with a dict
prompt4 = ‘I hate you!’ #produces a list with a dict
prompts = [prompt1, prompt2, prompt3, prompt4]
for prompt in prompts:
res = ort_pipe(prompt)
print(f"Prompt: {prompt}“)
for item in res:
next_item = item
while isinstance(next_item, list): next_item = next_item[0] #loop until no list (looking for dict)
print(f”\n{next_item[‘generated_text’]}")
else: pass
else: pass
‘’’

Here is the code again without getting mangled up.

‘’’
#!/usr/bin/env python
#standard imports
import os
import subprocess
from pathlib import Path

devs = subprocess.Popen(‘lspci -nnk | grep -iA2 vga’, shell=True, stdout=subprocess.PIPE).stdout.read().decode(‘utf-8’).split(‘\n’)
print()
for dev in devs: print(‘Device:’, dev)
print(“Info on devices not being used.\n”)

#third-party imports
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
from optimum.pipelines import pipeline
import torch
from tensorflow.python.ops.initializers_ns import variables
#from optimum.onnxruntime import OptimizationConfig, ORTOptimizer #deferred to optimization

cliBool = False

#getting from huggingface.co
model_remote = ‘databricks/dolly-v2-3b’
model_path = Path(‘~/.cache/huggingface/hub/models–databricks–dolly-v2-3b’).expanduser()
if model_path.exists():
print(“************* 1: Gettting local PyTorch Model”)
#… from local cache; process does not expand user path; also, needs to point at config.json file location
model_path = model_path / ‘snapshots/f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df’
#saving to: /home/jellybean/.cache/huggingface/hub/models–databricks–dolly-v2-3b/snapshots/f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df/optimum/onnx_rt
model_save = model_path / ‘optimum/onnx_rt_optimized’
else:
print(“************* 1: Getting Hugging Face PyTorch Model”)
model_path = model_remote
#mannually move it to cache snapshot upon created
model_save = Path(‘~/.cache/huggingface/hub/’).expanduser() / ‘optimum/onnx_rt’

#check file exist so we don’t convert again
config = model_save / ‘config.json’

if not config.exists():
if cliBool:
print(“************* 2: CLI-ONNX Convert/Optimize”)
#switching to CPU; need to specify --task if not from huggingface; optimizing to Level O3…
#… export dolly to onnx: > optimum-cli export onnx --help
cli = f’optimum-cli export onnx -m {model_path} --task text-generation-with-past --framework pt --no-post-process --optimize O3 {model_save}’
with subprocess.Popen([cli], shell=True, stdout=subprocess.PIPE, stdin=subprocess.PIPE) as proc:
_, _ = proc.communicate()
else:
print(“************* 2: Programmatic-ONNX Convert/Optimize ONNX”)
from optimum.onnxruntime import AutoOptimizationConfig, ORTOptimizer

	print("************* 2.a")
	model = ORTModelForCausalLM.from_pretrained(model_path, export=True)
	print("************* 2.b")
	optimizer = ORTOptimizer.from_pretrained(model)
	print("************* 2.c")
	#optimizing to Level O3
	optimization_config = AutoOptimizationConfig.O3()
	print(optimization_config)
	print("************* 2.d")
	optimizer.optimize(save_dir=model_save, optimization_config=optimization_config)

else: print(“************* 2: Skipping ONNX Convertion/Optimization”)

#inference using pt to onnx model
print(“************* 3: ONNX Model Inference”)
tokenizer = AutoTokenizer.from_pretrained(model_save)
model_ort = ORTModelForCausalLM.from_pretrained(model_save)
ort_pipe = pipeline(‘text-generation’, model=model_ort, tokenizer=tokenizer, accelerator=‘ort’, framework=‘pt’, device=-1, model_kwargs={“load_in_8bit”: True})
#try 3 different prompts
prompt1 = [‘Explain to me what is love.’, ‘You are an idiot’] #produces a list of two lists with dicts
prompt2 = [‘Explain to me what is hate.’] #produces a list of one list with dict
prompt3 = ‘Do you love yourself?.’ #produces a list with a dict
prompt4 = ‘I hate you!’ #produces a list with a dict
prompts = [prompt1, prompt2, prompt3, prompt4]
for prompt in prompts:
res = ort_pipe(prompt)
print(f"Prompt: {prompt}“)
for item in res:
next_item = item
while isinstance(next_item, list): next_item = next_item[0] #loop until no list (looking for dict)
print(f”\n{next_item[‘generated_text’]}")
else: pass
else: pass
‘’’

1 Like

Hi, a difference between the export CLI and ORTModel.from_pretrained(..., export=True) is that by default the CLI fuses the decoder without and with KV cache.

At the time, this feature was quite experimental and we did not make it default in the from_pretrained export.

You can try to pass use_merged=True to from_pretrained to lower the memory usage.

For reference:

See as well the note about the pytorch version here: Release v1.9: extended ONNX, ONNX Runtime support ¡ huggingface/optimum ¡ GitHub

Hi fxmarty,

Thanks for the advice. I added `use_merged=True. During the execution of from_pretrained(…) the following error shows:
Traceback (most recent call last):
File “/home/jellybean/workspace/careless_navigator/src/optimum/./dolly_ort_optimum.py”, line 87, in
model = ORTModelForCausalLM.from_pretrained(model_path, export=True, use_merged=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/jellybean/.virtualenvs/careless_navigator/lib/python3.11/site-packages/optimum/onnxruntime/modeling_ort.py”, line 646, in from_pretrained
return super().from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^
…
File “/home/jellybean/.virtualenvs/careless_navigator/lib64/python3.11/site-packages/torch/onnx/_internal/onnx_proto_utils.py”, line 175, in _export_file
opened_file.write(model_bytes)
OSError: [Errno 28] No space left on device

BTW, this happens with use_merged=False also. Something has changed since last time I ran the same code. Before I was able to get to the optimizer.optimize(…) step without this error showing. I have moved the model location to a 3TB disk with 1.8TB free space, with same results.

I verified that the CLI execution still runs correctly and produces an optimized ORT model.

Regardless, when I look at the source code for class ORTOptimizer, line 101
“if model_or_path.use_merged is True:” would raise raise NotImplementedError.

Best,
Borell

Hi,

I am wondering if this issue is being looked at by anyone, or should it be closed w/o resolution? I will presume that only the CLI version is a workable solution.

Best,
Borell

Hi Borell, it should not be the case. Can you open an issue on Github with a reproduction?

@fxmarty ,
Happy to do that. Can you please clarify on which Github to post?

@Borell Cool! You can open an issue in the GitHub repo of Optimum: Issues ¡ huggingface/optimum ¡ GitHub

@fxmarty, @regisss,

While trying to replicate the issue for posting into GitHub I made a change in the code which seems to have resolved the problem. In the following line,

model = ORTModelForCausalLM.from_pretrained(model_path, export=True, use_merged=True)

I changed use_merged = False. It was originally set to True as a suggestion to save memory utilization. I don’t fully understand the reason, but both CLI and programmatic executions work identically now.

The original model store size was 144.4GB reduced to 22.2GB.

I believe this issue can be close.

Thanks,
Borell