I want to finetune some layers of llama2-70b, but an OOM issue occured when I try to load the model with the function “from_pretrained”. I know there must be something wrong on model parallelism. My server has 8 * A100 and my codes are as follows:
dist.init_process_group(backend='nccl')
# torchrun specific
local_rank = int(os.environ["LOCAL_RANK"])
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
if torch.distributed.is_initialized():
torch.cuda.set_device(rank)
setup_environ_flags(rank)
model = LlamaForCausalLM.from_pretrained(model_path,
torch_dtype=torch.float16,
device_map="auto")
and I run this code with deepspeed:
deepspeed --num_gpus=8 --master_port $MASTER_PORT main.py --deepspeed deepspeed.json \ ...
and
torchrun --nnodes 1 --nproc_per_node 8 main.py --deepspeed deepspeed.json \..
also has the same OOM issue.
I would really appreciate any suggestions.