Run split-GPU inference with GPT-NeoX-20B

tohara-pandologic · March 20, 2023, 7:14am

Hi, is it possible to run inference with GPT-NeoX-20B in a split-GPU environment? I was hoping the following approach for GPT-J-6B would work (via EleutherAI/gpt-j-6B · GPTJForCausalLM hogs memory - inference only).

model = AutoModelForCausalLM.from_pretrained("Model Name")
tokenizer = AutoTokenizer.from_pretrained("Model Name")
from parallelformers import parallelize
parallelize(model, num_gpus=2, fp16=True, verbose='detail')

I ran into an unimplemented error when trying to do the same for GPT-NeoX-20B. I tried both AutoModelForCausalLM as above as well as GPTNeoXForCausalLM. The following assertion gets raised.

  AssertionError: GPTNeoXForCausalLM is not supported yet

Unfortunately, the parallelformers repo has not seen much activity in quite a while, so I doubt that is the way to go.

To run it on a single GPU, this would seem to require an 80gb GPU, which unfortunately is not yet available under AWS.

tohara-pandologic · April 3, 2023, 7:38am

I just verified for myself that this is very much possible. For example, it can be done via the text-generation-webui package using the following steps (for a 4-gpu machine):

model_name="EleutherAI/gpt-NeoX-20B"
model_dir=$(echo "$model_name" | perl -pe 's@/@_@;')

python download-model.py "$model_name"
python server.py --auto-device --gpu-memory 14 14 14 14 --model "$model_dir"

Behind the scene, this uses standard Hugging Face packages, which makes this forum appear less useful than hoped.

Note that the model is stored in the current directory rather than in the Hugging Face cache.

Topic		Replies	Views
GPT-NeoX inference OOM with plenty of available memory 🤗Transformers	2	893	August 1, 2023
Running out of memory attempting to load model "EleutherAI/gpt-neox-20b" Beginners	0	561	August 6, 2023
Am I doing multiple GPU right? Intermediate	8	429	November 29, 2024
Distributed Inference on GPT-2 Beginners	2	231	May 2, 2024
Issues running GPT-J-6B Beginners	1	1120	January 31, 2023

Run split-GPU inference with GPT-NeoX-20B

Related topics