Can't load huge model onto multiple GPU's

Hey there, so I have a model that takes 45+ GB to just LOAD, and I’m trying to load the model onto 4 A100 GPU’s, with 40 GB VRAM each and I do the following

model = model.from_pretrained(…,device_map=“auto”) and it still just gives me this error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 39.45 GiB total capacity; 38.59 GiB already allocated; 42.25
MiB free; 38.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid
fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I legit have no idea because this is documented on HuggingFace too as a solution that is supposed to work, any pointers?

Thank you.

I had trouble splitting models across 2 GPUs until I also set the max_memory parameter by creating a Python dict object such as the following

    limits = {}
    for n in range(len(self._maxGPUMemory)):
        limits[n] = f'{self._maxGPUMemory[str(n)]}MiB'

    params['max_memory'] = limits

where printing limits shows me limits {0: ‘10000MiB’, 1: ‘11000MiB’}

I set the memory limit for each GPU a bit less than the GPU memory and then it seems like the loading process worked

Hmm, how does that fix it though? Right now the model is nearly loaded on the first GPU but it straight up doesn’t recognize the other GPUs, does that mean that the memory limit on the others is 0?

I’m not sure how memory allocations on GPUs is done. I’m just learning how this works myself, and using the max_memory parameter seemed like it helped in my case. I’m guessing that without it, the allocation process tries to load to just one GPU and runs out of memory while setting max_memory value for each of the GPUs, 0, 1, 2, 3 to a lower limit solves this.
This page Handling big models for inference also mentions using infer_auto_device_map which attempts to map model layers to GPUs,instead of using an ‘auto’ map.

Hmm alright. Could you show me a snippet of how you set the device memory limit ?

I have a Python dict object named ‘params’ that I set all the parameters I need to load the model. I have the memory limit values for both my GPUs in another Python dict object loaded from a JSON profile file.
I set up ‘limits’ to contain my memory limit settings as

        limits = {}
        for n in range(len(self._maxGPUMemory)):
            limits[n] = f'{self._maxGPUMemory[str(n)]}MiB'

        if (self._useCPU):
            limits['cpu'] = f'{self._maxCPUMemory}MiB'
        params['max_memory'] = limits

After I set up the remaining params parameter values, I load the model as

        model = AutoModelForCausalLM.from_pretrained(self._modelPath, **params)

In the case where I want to build a device map instead of using ‘auto’ as my device_map I call accelerators.infer_auto_device_map as

            params['device_map'] = infer_auto_device_map(model, max_memory=params['max_memory'], no_split_module_classes=model._no_split_modules, dtype=torchDType)