Is this normal? Only major parameters that might be impacting is max_length. I am trying to finetune on two dataset one with average 4k input size and another 8k input size.
My training script is: training on multi-gpu · GitHub .
My accelerate config file is:
In which compute environment are you running?
This machine
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: no
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your DeepSpeed's ZeRO optimization stage?
3
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload optimizer states?
cpu
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload parameters?
cpu
How many gradient accumulation steps you're passing in your script? [1]: 2
Do you want to use gradient clipping? [yes/NO]: yes
What is the gradient clipping value? [1.0]: 1
Do you want to save 16-bit model weights when using ZeRO Stage-3? [yes/NO]: yes
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:3
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16
accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml
If it is normal what would be the ideal GPU memory require to train this?