GPT-NeoX inference OOM with plenty of available memory

I’ve been struggling with this for quite some time now. I want to test out GPT-NeoX on my own server with considerably more than 50GB of VRAM.
I load the model like this:
model = GPTNeoXForCausalLM.from_pretrained(“EleutherAI/gpt-neox-20b”).half().cuda()
also tried:
GPTNeoXForCausalLM.from_pretrained(“EleutherAI/gpt-neox-20b”, device_map=“balanced_low_0”, torch_dtype=torch.float16).to(“cuda”) - GPU0 has 16GB of memory
and
GPTNeoXForCausalLM.from_pretrained(“EleutherAI/gpt-neox-20b”, device_map="auto, torch_dtype=torch.float16).to(“cuda”)

The result is always the same. It starts loading the model equally on each card and when it reaches 0 tries to take more than the 16GB available. The other cards have plenty of VRAM left empty. At some point even more than half.
Error:
RuntimeError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 0; 15.99 GiB total capacity; 15.02 GiB already allocated; 0 bytes free; 15.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is the full code:
from transformers import AutoTokenizer, GPTNeoXForCausalLM, GPTNeoXTokenizerFast
import time
import torch
import faulthandler
faulthandler.enable()
model_id = “EleutherAI/gpt-neox-20b”
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GPTNeoXForCausalLM.from_pretrained(model_id, device_map=“auto”).half().cuda()
print(“model is loaded”)

I get the same when using 2x 3090

probably already solved, but in case anyone stumbles into this: try setting 'eval_accumulation_steps=1` in Trainer. Otherwise the inference results are accumulated on the host device (GPU) for the entire eval set.

1 Like