Llama2-70b-chat loading Cuda Out of Memory

anxiangyinyun · July 26, 2023, 8:47am

I want to finetune some layers of llama2-70b, but an OOM issue occured when I try to load the model with the function “from_pretrained”. I know there must be something wrong on model parallelism. My server has 8 * A100 and my codes are as follows:

dist.init_process_group(backend='nccl')
# torchrun specific
local_rank = int(os.environ["LOCAL_RANK"])
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
if torch.distributed.is_initialized():
        torch.cuda.set_device(rank)
        setup_environ_flags(rank)
model = LlamaForCausalLM.from_pretrained(model_path,
                                                  torch_dtype=torch.float16,
                                                device_map="auto")

and I run this code with deepspeed:

deepspeed --num_gpus=8  --master_port $MASTER_PORT main.py --deepspeed deepspeed.json \ ...

and

torchrun --nnodes 1 --nproc_per_node 8 main.py --deepspeed deepspeed.json \..

also has the same OOM issue.
I would really appreciate any suggestions.

Topic		Replies	Views
OOM when I using torch.nn.parallel.DistributedDataParallel to train LLAMA-7B Beginners	0	718	May 12, 2023
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	490	June 29, 2024
Inquiry Regarding Out of Memory Issue During LoRA Fine-Tuning Models	2	84	May 5, 2025
Running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine Intermediate	5	11040	December 21, 2023
CUDA OOM with deepspeed - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 209.12 MiB is free Beginners	0	164	December 14, 2024

Llama2-70b-chat loading Cuda Out of Memory

Related topics