I want to finetune some layers of llama2-70b, but an OOM issue occured when I try to load the model with the function “from_pretrained”. I know there must be something wrong on model parallelism. My server has 8 * A100 and my codes are as follows:
dist.init_process_group(backend='nccl') # torchrun specific local_rank = int(os.environ["LOCAL_RANK"]) rank = int(os.environ["RANK"]) world_size = int(os.environ["WORLD_SIZE"]) if torch.distributed.is_initialized(): torch.cuda.set_device(rank) setup_environ_flags(rank) model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
and I run this code with deepspeed:
deepspeed --num_gpus=8 --master_port $MASTER_PORT main.py --deepspeed deepspeed.json \ ...
torchrun --nnodes 1 --nproc_per_node 8 main.py --deepspeed deepspeed.json \..
also has the same OOM issue.
I would really appreciate any suggestions.