Skipped batches do not consider distributed training

I am having some confusion pertaining to the distributed training right now.

I have had enabled 2 GPUs in previous training run with below training config:

from transformers import TrainingArguments

output_dir = model_output_dir
per_device_train_batch_size = 6
gradient_accumulation_steps = 8
optim = "paged_adamw_32bit"
save_steps = 50
save_total_limit=3
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
warmup_ratio = 0.03
lr_scheduler_type = "cosine_with_restarts"
max_steps = 8000
group_by_length = True

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    save_total_limit=save_total_limit,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    # group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to = "tensorboard",
)

Now when I resumed the training from checkpoint-350, the stats printed are as below:

Loading model from ../data/models/falcon_7b_v0/checkpoint-350.
***** Running training *****
  Num examples = 522,319
  Num Epochs = 1
  Instantaneous batch size per device = 6
  Total train batch size (w. parallel, distributed & accumulation) = 48
  Gradient Accumulation steps = 8
  Total optimization steps = 8,000
  Number of trainable parameters = 130,547,712
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 0
  Continuing training from global step 350
  Will skip the first 0 epochs then the first 2800 batches in the first epoch.

My confusion is that the Total train batch size (w. parallel, distributed & accumulation) = 48 value should be 6 (batch size) * 8 (accumulation steps) * 2 (nos. of gpus) = 96 and the skipped batch (Will skip the first 0 epochs then the first 2800 batches in the first epoch. ) count must be 350 (steps) * 8 (accumulation) * 2 (gpus) = 5600 instead of 2800.

I am using below TRL trainer to train the model.

from trl import SFTTrainer

max_seq_length = 1024

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_arguments,
    peft_config=peft_config,
    train_dataset=math_qa_dataset["train"],
    formatting_func = dataset_formatting_func,
    max_seq_length=max_seq_length,
    data_collator=math_qa_data_collator,
    packing=True,
)

Could someone check whether it’s wrong statistics being printed or I am not utilising the multiple GPUs and distributed training here.