Convert GPT-j to FP-16 Onnx

pankajdev007 · January 24, 2023, 12:54pm

Hi I want to convert the GPT-j Model to ONNX to improve the inference speed. I tried to convert the model to ONNX, but it did not fit into the RAM, so I need to convert it to fp16, I tried the optimum optimizer but it says graph optimization not supported for gpt-j.

Here is the command with which I have converted it:
python -m optimum.exporters.onnx --task causal-lm-with-past --for-ort --model EleutherAI/gpt-j-6B gptj_onnx/

can anyone help in this regard!

@fxmarty can you help? I get the idea to convert it to onnx through your answer on this post:

any help would be great push!

fxmarty · January 26, 2023, 3:19pm

@pankajdev007 Thanks for trying out! I’ll have a look shortly!

pankajdev007 · January 27, 2023, 9:32am

One thing I also noted while doing that… if the model size is 5GB (eg. GPT Neo 1.3B) the convert ONNX model take up to 2.5 times the VRAM while inference… that is too high. So if I try to run GPT-j it takes 50-60GB RAM to run inference. Is there any way or I am doing something wrong.

I want to reduce the latency for GPT-j as currently it is slow even on GPU for generating 4-500 tokens!

fxmarty · February 6, 2023, 2:39pm

Hi @pankajdev007 Right, this is not ideal. Currently the memory is duplicated for decoder models, as there is an ONNX that does not use the past key/values (for the first decoding iteration), and an ONNX that does use them.

This PR should be merged soon and fix the issue: Enable inference with a merged decoder in `ORTModelForCausalLM` by JingyaHuang · Pull Request #647 · huggingface/optimum · GitHub

Additionally, I added a support to export directly models in float16, passing --fp16 --device cuda to the ONNX exporter: Support ONNX export on `torch.float16` type by fxmarty · Pull Request #749 · huggingface/optimum · GitHub

Hopefully we will soon have a release that include these two PRs!

silvacarl · March 10, 2023, 11:21pm

were you able to get this to work?

Topic		Replies	Views
Reducing latency for GPT-J Beginners	9	2447	December 18, 2022
Exporting GPTJ model to onnx is not supported 🤗Transformers	1	865	August 3, 2022
ONNX Flan-T5 Model OOM on GPU 🤗Optimum	2	2640	June 15, 2023
CUDA OOM when export a large model to ONNX 🤗Optimum	6	2155	July 26, 2025
ONNX on GPU memory footprint 🤗Optimum	2	1437	January 30, 2023

Convert GPT-j to FP-16 Onnx

Related topics