OOM when I using torch.nn.parallel.DistributedDataParallel to train LLAMA-7B

daisr · May 12, 2023, 12:29am

I intended to use 4 NVIDIA 3090 GPUs to train LLAMA-7B(float16) with DistributedDataParallel ,but OOM occured. The llama-7b occupied around 15G memory in one gpu(24G in total), but once I called the function torch.nn.parallel.DistributedDataParallel(), the error happens.

-----the code is like below---------
model=LlamaForCausalLM.from_pretrained(“xxxx”,torch_dtype=torch.float16)
model.to(device) #running ok here ,GPU memory is 15G/24G
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True) #Out of memory
happened here

I would like to know why this happens after calling the DDP function and why memory is reallocated again? Does anyone know? I’m very grateful for that.

Topic		Replies	Views
Llama2-70b-chat loading Cuda Out of Memory Models	0	1224	July 26, 2023
Hugging face accelerate and torch DDP crash with out-of-memory errors for a model runs fine on a single GPU 🤗Accelerate	3	4576	January 1, 2024
Is it normal of more memory use of DistributedDataParallel than single Beginners	2	827	June 22, 2021
OOM error with multi-GPU training of Llama2-70B using QLora 🤗Accelerate	2	2567	October 17, 2023
Loadig the LLAMA 30B Model. Memory Issue Models	2	2185	July 27, 2023

OOM when I using torch.nn.parallel.DistributedDataParallel to train LLAMA-7B

Related topics