I’m trying to use
accelerate module to parallelize my model training. But I have troubles to use it when training models with
fp16. If I load the model with
torch_dtype=torch.float16, I got
ValueError: Attempting to unscale FP16 gradients.. But if I don’t load the model with half precision I will get a CUDA out of memory error. Below are the details of this problem:
I’m fine tuning a 2.7B CLM, the
stanford-crfm/BioMedLM model, on one A100 - 40GB GPU (I will be working on a much larger model, but I want to use this model to test my training process to make sure everything works as expected). I initially started with a training script without
accelerate and without
Trainer. I can successfully train the model when I load the model in half precision with:
# here device = 'cuda' model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
I have to load the model in half precision otherwise I will get a CUDA out of memory error. I simplified my script and uploaded here as a demonstration. When loading the model with half precision, it takes about 27GB GPU memory out of 40GB in the training process. It has plenty of rooms left on the GPU memory.
Now I want to utilize the
accelerate module (potentially with
deepspeed for larger models) in my training script. I made the following changes:
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16) accelerator = Accelerator(cpu=False, mixed_precision='fp16') ... model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader ) ... # in training loop, I updated `lose.backward()` to: accelerator.backward(loss)
Here is the updated script. I also configured
accelerate config. The
default_config.yaml can be found from the same Gist.
Now when I tried to launch the script on the same machine with
accelerate launch --fp16 <script_path>. I got an error
ValueError: Attempting to unscale FP16 gradients.. So I removed
torch_dtype=torch.float16 from model loading and rely on
accelerate to downcast the model weight to half precision. But now I got CUDA out of memory error.
- I can train the model successfully when loading it with
torch_dtype=torch.float16and not using
accelerate, I cannot load the model with
torch_dtype=torch.float16. It gives
ValueError: Attempting to unscale FP16 gradients..
- If I don’t load the model with
accelerate, I got CUDA out of memory error.
So my question is: how can I train the model on a single A100 - 40GB GPU with