Training llama2-13b-16k model with peft on 3 A100 of 80GB is still throwing cuda out of memory

Is this normal? Only major parameters that might be impacting is max_length. I am trying to finetune on two dataset one with average 4k input size and another 8k input size.

My training script is: training on multi-gpu · GitHub .

My accelerate config file is:

In which compute environment are you running?
This machine                                                                                                                                                                                                                                                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                                                                                                                                                                                            
multi-GPU                                                                                                                                                                                                                                                                                                       
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                                                                                                                                                                                      
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no                                                                                                                                                                               
Do you wish to optimize your script with torch dynamo?[yes/NO]:no                                                                                                                                                                                                                                               
Do you want to use DeepSpeed? [yes/NO]: yes                                                                                                                                                                                                                                                                     
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: no                                                                                                                                                                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your DeepSpeed's ZeRO optimization stage?
3                                                                                                                                                                                                                                                                                                               
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload optimizer states?                                                                                                                                                                                                                                                                              
cpu                                                                                                                                                                                                                                                                                                             
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload parameters?                                                                                                                                                                                                                                                                                    
cpu                                                                                                                                                                                                                                                                                                             
How many gradient accumulation steps you're passing in your script? [1]: 2                                                                                                                                                                                                                                      
Do you want to use gradient clipping? [yes/NO]: yes                                                                                                                                                                                                                                                             
What is the gradient clipping value? [1.0]: 1                                                                                                                                                                                                                                                                   
Do you want to save 16-bit model weights when using ZeRO Stage-3? [yes/NO]: yes
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:3
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16                                                                                                                                                                                                                                                                                                            
accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml     

If it is normal what would be the ideal GPU memory require to train this?