Using GPT-Neo-125M with ONNX

Hey there,

I’m currently trying to export a GPT-Neo-125M (EleutherAI/gpt-neo-125M · Hugging Face) to run in a ONNX session as it claims to be faster.

I’m trying to host a service that takes a prompt and gives back a reply over CPU inference.

Initial benchmarks were very promising, so I wanted to move ahead.
However, I can’t find any documentation online as to how to get an actual generated text. I spent the rest of the day trying to reverse engineer it based on the source of huggingface and some bits and pieces I found online (which were mostly about other models).

Here’s my steps:

The test_onnx was formed based on what I could find, but the tokens responded were super weird:

awsw-dev@3564dfe571f0:/opt/awsw$ python3 test_onnx.py 
Loaded model
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 1 took 1.0165s...
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 2 took 1.0314s...
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 3 took 0.7820s...
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 4 took 1.1293s...
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 5 took 0.8726s...
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 6 took 0.9794s...
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 7 took 1.0277s...
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 8 took 0.7913s...
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 9 took 1.2652s...
Here be dragons... trososequose trose�agososeactosososose�ag her trose�actosososososose
Test run 10 took 1.1225s...

If anyone has any idea, please lmk. DDG/Google were resulting in nothing. All I could find is people using it as a benchmark but no decoding back to tokens. Thanks a lot!