Can't load huge model onto multiple GPU's

fffiend · June 9, 2023, 7:05pm

Hey there, so I have a model that takes 45+ GB to just LOAD, and I’m trying to load the model onto 4 A100 GPU’s, with 40 GB VRAM each and I do the following

model = model.from_pretrained(…,device_map=“auto”) and it still just gives me this error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 39.45 GiB total capacity; 38.59 GiB already allocated; 42.25
MiB free; 38.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid
fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I legit have no idea because this is documented on HuggingFace too as a solution that is supposed to work, any pointers?

Thank you.

drwootton · June 14, 2023, 8:13pm

I had trouble splitting models across 2 GPUs until I also set the max_memory parameter by creating a Python dict object such as the following

    limits = {}
    for n in range(len(self._maxGPUMemory)):
        limits[n] = f'{self._maxGPUMemory[str(n)]}MiB'

    params['max_memory'] = limits

where printing limits shows me limits {0: ‘10000MiB’, 1: ‘11000MiB’}

I set the memory limit for each GPU a bit less than the GPU memory and then it seems like the loading process worked

fffiend · June 14, 2023, 11:01pm

Hmm, how does that fix it though? Right now the model is nearly loaded on the first GPU but it straight up doesn’t recognize the other GPUs, does that mean that the memory limit on the others is 0?

drwootton · June 15, 2023, 1:06am

I’m not sure how memory allocations on GPUs is done. I’m just learning how this works myself, and using the max_memory parameter seemed like it helped in my case. I’m guessing that without it, the allocation process tries to load to just one GPU and runs out of memory while setting max_memory value for each of the GPUs, 0, 1, 2, 3 to a lower limit solves this.
This page Handling big models for inference also mentions using infer_auto_device_map which attempts to map model layers to GPUs,instead of using an ‘auto’ map.

fffiend · June 15, 2023, 3:15pm

Hmm alright. Could you show me a snippet of how you set the device memory limit ?

drwootton · June 15, 2023, 11:19pm

I have a Python dict object named ‘params’ that I set all the parameters I need to load the model. I have the memory limit values for both my GPUs in another Python dict object loaded from a JSON profile file.
I set up ‘limits’ to contain my memory limit settings as


        limits = {}
        for n in range(len(self._maxGPUMemory)):
            limits[n] = f'{self._maxGPUMemory[str(n)]}MiB'

        if (self._useCPU):
            limits['cpu'] = f'{self._maxCPUMemory}MiB'
        params['max_memory'] = limits

After I set up the remaining params parameter values, I load the model as

        model = AutoModelForCausalLM.from_pretrained(self._modelPath, **params)

In the case where I want to build a device map instead of using ‘auto’ as my device_map I call accelerators.infer_auto_device_map as

            params['device_map'] = infer_auto_device_map(model, max_memory=params['max_memory'], no_split_module_classes=model._no_split_modules, dtype=torchDType)

Topic		Replies	Views
How to load large model with multiple GPU cards? Beginners	8	43363	October 25, 2023
Failed to Initialize Bloom-7B Due to Lack of CUDA memory Inference Endpoints on the Hub	5	805	May 30, 2023
Multi GPU Training with Trainer and TokenClassification Model 🤗Transformers	0	1517	July 21, 2023
torch.cuda.OutOfMemoryError 🤗Transformers	0	2048	July 5, 2023
CUDA out of memory on multi-GPU 🤗Transformers	1	2641	March 6, 2024

Can't load huge model onto multiple GPU's

Related topics