Convert GPT-j to FP-16 Onnx

Hi I want to convert the GPT-j Model to ONNX to improve the inference speed. I tried to convert the model to ONNX, but it did not fit into the RAM, so I need to convert it to fp16, I tried the optimum optimizer but it says graph optimization not supported for gpt-j.

Here is the command with which I have converted it:
python -m optimum.exporters.onnx --task causal-lm-with-past --for-ort --model EleutherAI/gpt-j-6B gptj_onnx/

can anyone help in this regard!

@fxmarty can you help? I get the idea to convert it to onnx through your answer on this post:

any help would be great push!

@pankajdev007 Thanks for trying out! I’ll have a look shortly!

1 Like

One thing I also noted while doing that… if the model size is 5GB (eg. GPT Neo 1.3B) the convert ONNX model take up to 2.5 times the VRAM while inference… that is too high. So if I try to run GPT-j it takes 50-60GB RAM to run inference. Is there any way or I am doing something wrong.

I want to reduce the latency for GPT-j as currently it is slow even on GPU for generating 4-500 tokens!

Hi @pankajdev007 Right, this is not ideal. Currently the memory is duplicated for decoder models, as there is an ONNX that does not use the past key/values (for the first decoding iteration), and an ONNX that does use them.

This PR should be merged soon and fix the issue: Enable inference with a merged decoder in `ORTModelForCausalLM` by JingyaHuang · Pull Request #647 · huggingface/optimum · GitHub

Additionally, I added a support to export directly models in float16, passing --fp16 --device cuda to the ONNX exporter: Support ONNX export on `torch.float16` type by fxmarty · Pull Request #749 · huggingface/optimum · GitHub

Hopefully we will soon have a release that include these two PRs!

were you able to get this to work?