Hi, is it possible to run inference with GPT-NeoX-20B in a split-GPU environment? I was hoping the following approach for GPT-J-6B would work (via EleutherAI/gpt-j-6B · GPTJForCausalLM hogs memory - inference only).
model = AutoModelForCausalLM.from_pretrained("Model Name")
tokenizer = AutoTokenizer.from_pretrained("Model Name")
from parallelformers import parallelize
parallelize(model, num_gpus=2, fp16=True, verbose='detail')
I ran into an unimplemented error when trying to do the same for GPT-NeoX-20B. I tried both AutoModelForCausalLM as above as well as GPTNeoXForCausalLM. The following assertion gets raised.
AssertionError: GPTNeoXForCausalLM is not supported yet
Unfortunately, the parallelformers repo has not seen much activity in quite a while, so I doubt that is the way to go.
To run it on a single GPU, this would seem to require an 80gb GPU, which unfortunately is not yet available under AWS.