The general Problem
I have had trouble loading the model “EleutherAI/gpt-neox-20b” using the GPTNeoXForCausalLM.from_pretrained() method. I have two GPUs each with about 31.74 GiB of memory available.
Can someone tell me what I am doing wrong or guide me to some practical documentation that might help? Below I have described what I have already tried.
Basic approach recommended by the docs
This approach is recommended by the documentation here
model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b").half().cuda()
This leads to the following error:
CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 31.74 GiB total capacity; 30.54 GiB already allocated; 242.81 MiB free; 30.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Approach from Eleuther Github Page
This approach was recommended here
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neox-20b')
max_memory = get_balanced_memory(
model,
max_memory=None,
no_split_module_classes=["GPTNeoXLayer"],
dtype='float16',
low_zero=False,
)
device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["GPTNeoXLayer"],
dtype='float16'
)
model = dispatch_model(model, device_map=device_map)
This leads to the following error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 1; 31.74 GiB total capacity; 30.48 GiB already allocated; 317.12 MiB free; 30.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Personal experimentation
I experimented around with some of the suggestions from huggingface for loading large models. I have tried several permutations, but they lead to similar results. One example would be:
model = GPTNeoXForCausalLM.from_pretrained(checkpoint, device_map="auto", low_cpu_mem_usage=True, torch_dtype=torch.float16)
This does NOT run into an error. However when I try to instantiate the trainer I get the following error: