Problem with onnx export and usage

Hello everyone, I have been trying to speed up the GPT-Neo 1.3B model using Onnx, and have been facing significant issues.

I first exported the GPT-Neo 1.3B model using the Causal-LM feature. This created a folder with lots of files and the model.onnx file as well.

Thereafter I tried using the onnx model using onnx-runtime as shown in the this page.

Here is the code I used.

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
ONNX_PROVIDERS = ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

session = rt.InferenceSession("onnx/model.onnx", providers=ONNX_PROVIDERS)

inputs = tokenizer("Using gpt-neo with ONNX Runtime and ", return_tensors="np")
outputs =["logits"], input_feed=dict(inputs))

I used the %%time magic in the Jupyter cell and the above code took more than 5 minutes to execute.

After that I used a longer sentence and tried to inference again but the cell never completed execution (I waited for about an hour).

inputs = tokenizer("Using gpt-neo with ONNX Runtime again and this time with many more words which will put considerable load on the GPU as well as the CPU ", return_tensors="np")
outputs =["logits"], input_feed=dict(inputs))

I seem to be missing something, as I am certain this shouldn’t take so long. Could anyone please help me?