Hi there,
I am trying to use Accelerate to to distribute my training across multiple GPUs and I noticed that the training loss wasn’t decreasing as fast as it should so I ran a new accelerate config
to create the most basic one, without distributed training, to debug. Here it is:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: '3'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
I seem to still be facing the same problem, where the training loss decreases much slower, compared to a version of my script using only Pytorch. Here is the first version of training script I decided to use with Accelerate:
for epoch in range(1):
model.train()
dropout_modules = [module for module in model.lang_encoder.modules() if isinstance(module,torch.nn.Dropout)]
[module.eval() for module in dropout_modules] # disable dropout
accelerator.print(f"Disabled {len(dropout_modules)} Dropout modules")
for i, batch in enumerate(tqdm(llavar_dataloader, disable=not accelerator.is_local_main_process)):
with accelerator.accumulate(model):
outputs = model(vision_x=batch["vision_x"], lang_x=batch["lang_x"].to(device), labels=batch["labels"].to(device), device=device)
loss = outputs.loss
if accelerator.is_local_main_process:
total_loss += float(loss.item())
accelerator.backward(loss)
if accelerator.sync_gradients:
accelerator.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
model.zero_grad()
if ((i+1) % accelerator.gradient_accumulation_steps == 0 or i+1 == len(llavar_dataloader)) and accelerator.is_local_main_process:
with open("llavar_"+args.file, "a") as f:
f.write(f"Epoch: {epoch}, Loss: {total_loss/args.acc}\n")
total_loss = 0 # Reset accumulated loss
Here is the problem: the training loss of both versions, using Accelerate and Pytorch, in the first iteration is the exact same. After that the loss decreases much faster in the pure Pytorch script. Note that I am using seeds in both, so that is why the first iteration is the exact same…
To debug I decided to remove the .accumulate
call and build my own gradient accumulation flow, like this:
for epoch in range(1):
model.train()
dropout_modules = [module for module in model.lang_encoder.modules() if isinstance(module,torch.nn.Dropout)]
[module.eval() for module in dropout_modules] # disable dropout
accelerator.print(f"Disabled {len(dropout_modules)} Dropout modules")
for i, batch in enumerate(tqdm(llavar_dataloader, disable=not accelerator.is_local_main_process)):
outputs = model(vision_x=batch["vision_x"], lang_x=batch["lang_x"].to(device), labels=batch["labels"].to(device), device=device)
loss = outputs.loss
if accelerator.is_local_main_process:
total_loss += float(loss.item())
accelerator.backward(loss / args.acc)
if ((i+1) % accelerator.gradient_accumulation_steps == 0 or i+1 == len(llavar_dataloader)) and accelerator.is_local_main_process:
with open("llavar_"+args.file, "a") as f:
f.write(f"Epoch: {epoch}, Loss: {total_loss/args.acc}\n")
total_loss = 0 # Reset accumulated loss
accelerator.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
model.zero_grad()
and the results were very similar to the results in my Pytorch script when I removed gradient clipping. I also tested commenting out the line accelerator.clip_grad_norm_(model.parameters(), 1.0)
and the results were the exact same. Note that added gradient clipping in my Pytorch script changes results a bit, and I also add it right before I call optimizer.step()
.
It looks like this issue has something to do with how gradients are accumulated/clipped, but I can’t seem to pinpoint the problem…
UPDATE: switching accelerator.clip_grad_norm_
for the regular torch.nn.utils.clip_grad_norm_
does nothing to the training loss - using any form of gradient clipping (or none) seems to yield the exact same results using Accelerate, which is the same training loss as my pure Pytorch script without gradient clipping, but adding gradient clipping in my Pytorch script further improves the training loss…
Also, here is my pure Pytorch training loop/script:
for epoch in range(1):
model.train()
dropout_modules = [module for module in model.lang_encoder.modules() if isinstance(module,torch.nn.Dropout)]
[module.eval() for module in dropout_modules] # disable dropout
print(f"Disabled {len(dropout_modules)} Dropout modules")
for i, batch in enumerate(tqdm(coco_dataloader)):
# Assume batch['image'] and batch['text'] are image and text tensors respectively
outputs = model(vision_x=batch["vision_x"], lang_x=batch["lang_x"].to(f"cuda:{args.cuda}"), labels=batch["labels"].to(f"cuda:{args.cuda}"), device=f"cuda:{args.cuda}")
loss = outputs.loss / num_accumulation_steps
loss.backward()
if (i+1) % num_accumulation_steps != 0:
total_loss = loss.item() + total_loss
# done once every num_accumulation_steps (5) steps
elif (i+1) % num_accumulation_steps == 0:
# Normalize loss
total_loss = (total_loss + loss.item())
# parameters updated
torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
optimizer.step()
# Reset gradient tensors
model.zero_grad()
with open("llavar_"+args.file, "a") as f:
f.write(f"Epoch: {epoch}, Loss: {total_loss}\n")
total_loss=0