GPT-NeoX inference OOM with plenty of available memory

DontLike · October 5, 2022, 3:11pm

I’ve been struggling with this for quite some time now. I want to test out GPT-NeoX on my own server with considerably more than 50GB of VRAM.
I load the model like this:
model = GPTNeoXForCausalLM.from_pretrained(“EleutherAI/gpt-neox-20b”).half().cuda()
also tried:
GPTNeoXForCausalLM.from_pretrained(“EleutherAI/gpt-neox-20b”, device_map=“balanced_low_0”, torch_dtype=torch.float16).to(“cuda”) - GPU0 has 16GB of memory
and
GPTNeoXForCausalLM.from_pretrained(“EleutherAI/gpt-neox-20b”, device_map="auto, torch_dtype=torch.float16).to(“cuda”)

The result is always the same. It starts loading the model equally on each card and when it reaches 0 tries to take more than the 16GB available. The other cards have plenty of VRAM left empty. At some point even more than half.
Error:
RuntimeError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 0; 15.99 GiB total capacity; 15.02 GiB already allocated; 0 bytes free; 15.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is the full code:
from transformers import AutoTokenizer, GPTNeoXForCausalLM, GPTNeoXTokenizerFast
import time
import torch
import faulthandler
faulthandler.enable()
model_id = “EleutherAI/gpt-neox-20b”
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GPTNeoXForCausalLM.from_pretrained(model_id, device_map=“auto”).half().cuda()
print(“model is loaded”)

DontLike · November 15, 2022, 7:10am

I get the same when using 2x 3090

harpone · August 1, 2023, 12:36pm

probably already solved, but in case anyone stumbles into this: try setting 'eval_accumulation_steps=1` in Trainer. Otherwise the inference results are accumulated on the host device (GPU) for the entire eval set.

Topic		Replies	Views
Running out of memory attempting to load model "EleutherAI/gpt-neox-20b" Beginners	0	561	August 6, 2023
Cuda OOM Error When Finetuning GPT Neo 2.7B Beginners	1	1235	November 9, 2021
Run split-GPU inference with GPT-NeoX-20B Models	1	739	April 3, 2023
Am I doing multiple GPU right? Intermediate	8	429	November 29, 2024
Solving "CUDA out of memory" when fine-tuning GPT-2 🤗Transformers	0	1407	January 6, 2022

GPT-NeoX inference OOM with plenty of available memory

Related topics