Run split-GPU inference with GPT-NeoX-20B

Hi, is it possible to run inference with GPT-NeoX-20B in a split-GPU environment? I was hoping the following approach for GPT-J-6B would work (via EleutherAI/gpt-j-6B · GPTJForCausalLM hogs memory - inference only).

model = AutoModelForCausalLM.from_pretrained("Model Name")
tokenizer = AutoTokenizer.from_pretrained("Model Name")
from parallelformers import parallelize
parallelize(model, num_gpus=2, fp16=True, verbose='detail')

I ran into an unimplemented error when trying to do the same for GPT-NeoX-20B. I tried both AutoModelForCausalLM as above as well as GPTNeoXForCausalLM. The following assertion gets raised.

  AssertionError: GPTNeoXForCausalLM is not supported yet

Unfortunately, the parallelformers repo has not seen much activity in quite a while, so I doubt that is the way to go.

To run it on a single GPU, this would seem to require an 80gb GPU, which unfortunately is not yet available under AWS.

I just verified for myself that this is very much possible. For example, it can be done via the text-generation-webui package using the following steps (for a 4-gpu machine):

model_dir=$(echo "$model_name" | perl -pe 's@/@_@;')

python "$model_name"
python --auto-device --gpu-memory 14 14 14 14 --model "$model_dir"

Behind the scene, this uses standard Hugging Face packages, which makes this forum appear less useful than hoped.

Note that the model is stored in the current directory rather than in the Hugging Face cache.