I am trying to run multi-gpu inference for LLAMA 2 7B. I am running on NVIDIA RTX A6000 gpu’s, so the model should fit on a single gpu. I ran set the accelerate config file as follows:
Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]:
Do you want to use FullyShardedDataParallel? [yes/NO]:
Do you want to use Megatron-LM ? [yes/NO]:
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:4,5,6,7
I run with accelerate launch and get the following memory error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 25.10 GiB. GPU 0 has a total capacty of 47.54 GiB of which 21.88 GiB is free. Including non-PyTorch memory, this process has 25.65 GiB memory in use. Of the allocated memory 25.23 GiB is allocated by PyTorch, and 12.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
triggered by the line
model, dataloader = accelerator.prepare(model, dataloader)
Why is this happening? I set the dataloader to be a trivially small dataset. So given that, the GPU has capcity 49140 MB and Llama 2 7B is about 13500 MB, so I should not be having issues with memory.