I am having some confusion pertaining to the distributed training right now.
I have had enabled 2 GPUs in previous training run with below training config:
from transformers import TrainingArguments
output_dir = model_output_dir
per_device_train_batch_size = 6
gradient_accumulation_steps = 8
optim = "paged_adamw_32bit"
save_steps = 50
save_total_limit=3
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
warmup_ratio = 0.03
lr_scheduler_type = "cosine_with_restarts"
max_steps = 8000
group_by_length = True
training_arguments = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
save_total_limit=save_total_limit,
logging_steps=logging_steps,
learning_rate=learning_rate,
fp16=True,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
# group_by_length=group_by_length,
lr_scheduler_type=lr_scheduler_type,
report_to = "tensorboard",
)
Now when I resumed the training from checkpoint-350, the stats printed are as below:
Loading model from ../data/models/falcon_7b_v0/checkpoint-350.
***** Running training *****
Num examples = 522,319
Num Epochs = 1
Instantaneous batch size per device = 6
Total train batch size (w. parallel, distributed & accumulation) = 48
Gradient Accumulation steps = 8
Total optimization steps = 8,000
Number of trainable parameters = 130,547,712
Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 0
Continuing training from global step 350
Will skip the first 0 epochs then the first 2800 batches in the first epoch.
My confusion is that the Total train batch size (w. parallel, distributed & accumulation) = 48 value should be 6 (batch size) * 8 (accumulation steps) * 2 (nos. of gpus) = 96 and the skipped batch (Will skip the first 0 epochs then the first 2800 batches in the first epoch. ) count must be 350 (steps) * 8 (accumulation) * 2 (gpus) = 5600 instead of 2800.
I am using below TRL trainer to train the model.
from trl import SFTTrainer
max_seq_length = 1024
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_arguments,
peft_config=peft_config,
train_dataset=math_qa_dataset["train"],
formatting_func = dataset_formatting_func,
max_seq_length=max_seq_length,
data_collator=math_qa_data_collator,
packing=True,
)
Could someone check whether it’s wrong statistics being printed or I am not utilising the multiple GPUs and distributed training here.