Running out of memory attempting to load model "EleutherAI/gpt-neox-20b"

The general Problem

I have had trouble loading the model “EleutherAI/gpt-neox-20b” using the GPTNeoXForCausalLM.from_pretrained() method. I have two GPUs each with about 31.74 GiB of memory available.

Can someone tell me what I am doing wrong or guide me to some practical documentation that might help? Below I have described what I have already tried.

Basic approach recommended by the docs

This approach is recommended by the documentation here

model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b").half().cuda()

This leads to the following error:

CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 31.74 GiB total capacity; 30.54 GiB already allocated; 242.81 MiB free; 30.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Approach from Eleuther Github Page

This approach was recommended here

model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neox-20b')

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["GPTNeoXLayer"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model, 
    max_memory=max_memory,
    no_split_module_classes=["GPTNeoXLayer"], 
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

This leads to the following error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 1; 31.74 GiB total capacity; 30.48 GiB already allocated; 317.12 MiB free; 30.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Personal experimentation

I experimented around with some of the suggestions from huggingface for loading large models. I have tried several permutations, but they lead to similar results. One example would be:

model = GPTNeoXForCausalLM.from_pretrained(checkpoint, device_map="auto", low_cpu_mem_usage=True, torch_dtype=torch.float16)

This does NOT run into an error. However when I try to instantiate the trainer I get the following error: