Fine-tuning a 16B CodeGen model with 256GB RAM+2xA6000s?

moyix · August 29, 2022, 11:13pm

I’m trying to figure out if it’s possible to fine-tune a 16B parameter model (CodeGen-16B-multi) on 2x A6000s (48GB each) and 256GB of RAM, using DeepSpeed to split the weights / gradients / optimizer states across the two GPUs if necessary. I have successfully fine-tuned the 6B version using this hardware setup, but at 16B I always run out of either RAM or GPU memory even at FP16 and batch size = 1. So far I’ve tried:

ZeRO Stage 2 (CUDA out of memory)
ZeRO Stage 3 offloading both params and optimizer states go the CPU (runs out of CPU RAM)
ZeRO Stage 3 offloading only the optimizer states to the CPU (runs out of CPU RAM)
ZeRO Stage 3 offloading only the params to the CPU (CUDA out of memory)
Using SGD instead of AdamW (CUDA out of memory)
Using SGD and gradient checkpointing (CUDA out of memory)

(The last two are not related to DeepSpeed but are included for completeness)

I’m using the basic run_clm.py script (so the HF Trainer) with a small modification to load a pre-tokenized and chunked version of my data rather than having to do it at the start of training.

Has anyone managed to train a model this large in HF with DeepSpeed? I saw someone else report that they got it to work with NVME offload, but I am hesitant to go this route due to the much slower speeds and increased wear and tear on the NVME.

I’m happy to share the exact scripts and command lines if they would be helpful!

moyix · September 1, 2022, 9:49pm

Based on estimate_zero3_model_states_mem_needs_all_live, I think it is just not possible with only 256GB of RAM without using NVME offload:

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 2 GPUs per node.
SW: Model with 16032M total params, 314M largest layer params.
  per CPU  |  per GPU |   Options
  403.14GB |   1.17GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  403.14GB |   1.17GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  358.35GB |  16.10GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  358.35GB |  16.10GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    3.52GB | 135.55GB | offload_param=none, offload_optimizer=none, zero_init=1
  179.17GB | 135.55GB | offload_param=none, offload_optimizer=none, zero_init=0

Hopefully estimate_zero3_model_states_mem_needs_all_live will help someone else estimate what’s possible on their system

ducst · July 3, 2023, 9:54am

Hi Moyix,

I’m facing the same problems. Could you please kindly share the exact scripts that you use to to finetune (successfully) the 6B model, including the deepspeed settings?

Thanks a lot!

Topic		Replies	Views
Fine-tune OPT 13B: CUDA out of memory error (720gb vram, batch size 1, fp16)! Beginners	6	4560	July 25, 2022
Finetune LLM with DeepSpeed DeepSpeed	2	5113	February 22, 2024
LED Memory Requirements Models	0	399	June 16, 2022
Fine-tuning T5 with long sequence length, using activation checkpointing with Deepspeed 🤗Transformers	6	2855	December 5, 2022
How to finetune mt0-xxl-mt(13B parameters) seq2seq_qa with deepspeed Models	0	795	February 15, 2023

Fine-tuning a 16B CodeGen model with 256GB RAM+2xA6000s?

Related topics