Fine-tuning a 16B CodeGen model with 256GB RAM+2xA6000s?

I’m trying to figure out if it’s possible to fine-tune a 16B parameter model (CodeGen-16B-multi) on 2x A6000s (48GB each) and 256GB of RAM, using DeepSpeed to split the weights / gradients / optimizer states across the two GPUs if necessary. I have successfully fine-tuned the 6B version using this hardware setup, but at 16B I always run out of either RAM or GPU memory even at FP16 and batch size = 1. So far I’ve tried:

  • ZeRO Stage 2 (CUDA out of memory)
  • ZeRO Stage 3 offloading both params and optimizer states go the CPU (runs out of CPU RAM)
  • ZeRO Stage 3 offloading only the optimizer states to the CPU (runs out of CPU RAM)
  • ZeRO Stage 3 offloading only the params to the CPU (CUDA out of memory)
  • Using SGD instead of AdamW (CUDA out of memory)
  • Using SGD and gradient checkpointing (CUDA out of memory)

(The last two are not related to DeepSpeed but are included for completeness)

I’m using the basic run_clm.py script (so the HF Trainer) with a small modification to load a pre-tokenized and chunked version of my data rather than having to do it at the start of training.

Has anyone managed to train a model this large in HF with DeepSpeed? I saw someone else report that they got it to work with NVME offload, but I am hesitant to go this route due to the much slower speeds and increased wear and tear on the NVME.

I’m happy to share the exact scripts and command lines if they would be helpful!

Based on estimate_zero3_model_states_mem_needs_all_live, I think it is just not possible with only 256GB of RAM without using NVME offload:

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 2 GPUs per node.
SW: Model with 16032M total params, 314M largest layer params.
  per CPU  |  per GPU |   Options
  403.14GB |   1.17GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  403.14GB |   1.17GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  358.35GB |  16.10GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  358.35GB |  16.10GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    3.52GB | 135.55GB | offload_param=none, offload_optimizer=none, zero_init=1
  179.17GB | 135.55GB | offload_param=none, offload_optimizer=none, zero_init=0

Hopefully estimate_zero3_model_states_mem_needs_all_live will help someone else estimate what’s possible on their system :slight_smile: