Llama2-70b-chat loading Cuda Out of Memory

I want to finetune some layers of llama2-70b, but an OOM issue occured when I try to load the model with the function “from_pretrained”. I know there must be something wrong on model parallelism. My server has 8 * A100 and my codes are as follows:

dist.init_process_group(backend='nccl')
# torchrun specific
local_rank = int(os.environ["LOCAL_RANK"])
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
if torch.distributed.is_initialized():
        torch.cuda.set_device(rank)
        setup_environ_flags(rank)
model = LlamaForCausalLM.from_pretrained(model_path,
                                                  torch_dtype=torch.float16,
                                                device_map="auto")
                                                  

and I run this code with deepspeed:

deepspeed --num_gpus=8  --master_port $MASTER_PORT main.py --deepspeed deepspeed.json \ ...

and

torchrun --nnodes 1 --nproc_per_node 8 main.py --deepspeed deepspeed.json \..

also has the same OOM issue.
I would really appreciate any suggestions.