I’m trying to figure out if it’s possible to fine-tune a 16B parameter model (CodeGen-16B-multi) on 2x A6000s (48GB each) and 256GB of RAM, using DeepSpeed to split the weights / gradients / optimizer states across the two GPUs if necessary. I have successfully fine-tuned the 6B version using this hardware setup, but at 16B I always run out of either RAM or GPU memory even at FP16 and batch size = 1. So far I’ve tried:
- ZeRO Stage 2 (CUDA out of memory)
- ZeRO Stage 3 offloading both params and optimizer states go the CPU (runs out of CPU RAM)
- ZeRO Stage 3 offloading only the optimizer states to the CPU (runs out of CPU RAM)
- ZeRO Stage 3 offloading only the params to the CPU (CUDA out of memory)
- Using SGD instead of AdamW (CUDA out of memory)
- Using SGD and gradient checkpointing (CUDA out of memory)
(The last two are not related to DeepSpeed but are included for completeness)
I’m using the basic run_clm.py script (so the HF Trainer) with a small modification to load a pre-tokenized and chunked version of my data rather than having to do it at the start of training.
Has anyone managed to train a model this large in HF with DeepSpeed? I saw someone else report that they got it to work with NVME offload, but I am hesitant to go this route due to the much slower speeds and increased wear and tear on the NVME.
I’m happy to share the exact scripts and command lines if they would be helpful!