I have been trying to finetune the facebook/opt-13b model using the run_clm.py script in transformers/examples/pytorch/language-modeling. I am using 8 x 80gb a100βs on paperspace.
The script works well for finetuning the smaller models.
I keep running into a RuntimeError: CUDA out of memory. Tried to allocate π MiB (GPU 0; 78.0 GiB total capacity; π GiB already allocated; MiB free; π cached)
This happens before any training has begun.
I have tried setting the batch size to 1, setting fp16 to true and have tried setting high gradient accumulation steps and a very low block size. Yet training still refuses to start.
I believe i should have enough vram to finetune these models, is there anything else that I should look into? Would integrating deepspeed into the run_clm script help?
You wonβt be able to fine-tune such a large model without using some of the sharding for the optimizer state and gradient. You should look into the DeepSpeed integration to use Zero-2 at least.
Would you be able to have a look at this set up to see if there is anything you would improve because training is very expensive and I want to fix any obvious errors before starting!
Hi @anujn, May I know how much RAM did you use? According to DeepSpeed, it needs 581.15GB per CPU.
from deepspeed.runtime.zero.stage_1_and_2mport estimate_zero2_model_states_mem_needs_all_cold;
estimate_zero2_model_states_mem_needs_all_cold(total_params=13e9, num_gpus_p_node=8, num_nodes=1)
Here is the result. It seems a little crazy if I want to train a bigger OPT model.
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 13000M total params.
per CPU | per GPU | Options
581.15GB | 24.21GB | offload_optimizer=cpu
581.15GB | 72.64GB | offload_optimizer=none