Is the Trainer slower than customised loops?

yonoteam · July 2, 2025, 9:49am

Hello, I have written two scripts attempting to implement the same training loop, one customised loop and another using the Trainer. The customised loop finished running in approximately 6 hours. Meanwhile, the Trainer loop has been running for 16 hours, and the progress bar only reports 21% progress. Both of them were executed via accelerate launch --gpu_ids 0,1,2,3 train_t5.py config_dict.json. Since I am self-taught and new to these libraries and concepts, I am likely missing some configuration in the Trainer to make the loop go as fast (if not faster) than the customised version. If not, what subprocess is necessary in the Trainer implementation that causes it to be slower? How can I turn it off? Thanks for any help or improvements you can suggest to any of the two scripts.

The customised version is:

def prepare_for_multi_train(model, tokenizer, train_data, valid_data, accelerator, batch_size=8):
    # Dataloaders
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding="max_length", max_length=model.config.n_positions)
    train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=False, collate_fn=data_collator)
    valid_dataloader = DataLoader(valid_data, batch_size=batch_size, shuffle=False, collate_fn=data_collator)

    # Optimizer and Scheduler
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.01)
    lr_scheduler = get_scheduler("constant", optimizer=optimizer)

    # Accelerate them
    train_dataloader = accelerator.prepare(train_dataloader)
    valid_dataloader = accelerator.prepare(valid_dataloader)
    log_dataloading(train_dataloader, accelerator)
    model, optimizer, lr_scheduler = accelerator.prepare(
            model, optimizer, lr_scheduler
    )
    return train_dataloader, valid_dataloader, model, optimizer, lr_scheduler

def load_model_tok_data(accelerator, config_dict):
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        # Tokenizer
        tokenizer = tokops.get_trained_tokenizer(config_dict, making_dirs=True)
        vocab_size = len(tokenizer)

        # Data
        train_kwargs = {
            "tokenizer": tokenizer,
            "json_data_dir": config_dict["data_dir"],
            "split": "train",
            "data_format": config_dict["data_format"]
        }
        train_data = IterableDataset.from_generator(
            tokops.t5_tokked_model_inputs,
            gen_kwargs=train_kwargs
        )
        valid_kwargs = train_kwargs.copy()
        valid_kwargs["split"] = "valid"
        valid_data = IterableDataset.from_generator(
            tokops.t5_tokked_model_inputs,
            gen_kwargs=valid_kwargs
        )

        # Model
        model = get_model(config_dict, vocab_size)
    else:
        tokenizer = None
        train_data, valid_data = None, None
        model = None

    accelerator.wait_for_everyone()
    tokenizer = broadcast_object_list([tokenizer])[0]
    train_data = broadcast_object_list([train_data])[0]
    valid_data = broadcast_object_list([valid_data])[0]
    model = broadcast_object_list([model])[0]    
    logging.info(f"{accelerator.process_index}: Successfully broadcasted data, the evidence is that the type of model is {type(model)}")
    return model, tokenizer, train_data, valid_data

def record(metrics, locals, accelerator, save_file="metrics.json"):
    accelerator.wait_for_everyone()
    local_stats = torch.tensor([locals["loss_sum"], locals["corrects_sum"], locals["valid_toks"], locals["train_step"]], device=accelerator.device)
    global_loss, global_corrects_sum, global_valid_toks, global_train_step = accelerator.reduce(local_stats, reduction="sum")
    if accelerator.is_main_process:
        avg_loss = global_loss.item() / global_train_step.item()
        metrics["loss"].append(avg_loss)
        metrics["accuracy"].append(global_corrects_sum.item() / global_valid_toks.item())
        metrics["steps"].append(global_train_step.item())
        logging.info(f"Current step's ({locals["train_step"]}) average loss is {avg_loss:.4f}")
        dicts.save_as_json(metrics, save_file)
    accelerator.wait_for_everyone()
    return metrics

def validate(model, dataloader, epoch, accelerator):
    model.eval()
    process_idx = accelerator.process_index
    metrics = {"loss": [], "accuracy": [], "steps": []}
    locals = {"loss_sum": 0.0, "corrects_sum": 0, "valid_toks": 0, "train_step": 0}
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            outputs = model(
                input_ids=batch["input_ids"], 
                attention_mask=batch["attention_mask"], 
                labels=batch["labels"]
            )
            loss = outputs.loss

            # Metrics
            predictions = outputs.logits.argmax(dim=-1)
            valids_mask = batch["labels"] != -100 # tokenizer.pad_token_id
            corrects = (predictions[valids_mask] == batch["labels"][valids_mask]).sum().item()
            locals["corrects_sum"] += corrects
            locals["valid_toks"] += valids_mask.sum().item()
            locals["loss_sum"] += loss.item()
            locals["train_step"] = batch_idx + 1

            if batch_idx % 1000 == 0:
                logging.info(f"{process_idx}: valid step number {batch_idx}")
                metrics = record(metrics, locals, accelerator, save_file=f"valid_metrics{epoch}.json")
    metrics = record(metrics, locals, accelerator, save_file=f"valid_metrics{epoch}.json")

def train(model, dataloader, optimizer, lr_scheduler, epoch, config_dict, accelerator):
    model.train()
    process_idx = accelerator.process_index
    metrics = {"loss": [], "accuracy": [], "steps": []}
    locals = {"loss_sum": 0.0, "corrects_sum": 0, "valid_toks": 0, "train_step": 0}
    for batch_idx, batch in enumerate(dataloader):
        # model.forward() and loss calculation
        outputs = model(
            input_ids=batch["input_ids"], 
            attention_mask=batch["attention_mask"], 
            labels=batch["labels"]
        )
        loss = outputs.loss
        log_nan_loss(loss, batch_idx, accelerator)
        
        # Backpropagation
        optimizer.zero_grad()
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()

        # Metrics
        predictions = outputs.logits.argmax(dim=-1)
        valids_mask = batch["labels"] != -100
        corrects = (predictions[valids_mask] == batch["labels"][valids_mask]).sum().item()
        locals["corrects_sum"] += corrects
        locals["valid_toks"] += valids_mask.sum().item()
        locals["loss_sum"] += loss.item()
        locals["train_step"] = batch_idx + 1

        # progress feedback
        if batch_idx % 1000 == 0:
            logging.info(f"{process_idx}: train step number {batch_idx}")
            metrics = record(metrics, locals, accelerator, save_file=f"train_metrics{epoch}.json")
        if batch_idx % 10000 == 0 and batch_idx > 0:
                if accelerator.is_main_process:
                    _, models_dir = save_ops.get_dirs(config_dict)
                    save_ops.save_in(accelerator.unwrap_model(model), models_dir)
        accelerator.wait_for_everyone()

    logging.info(f"{process_idx}: Total number of batches was {batch_idx + 1}")
    logging.info(f"{process_idx}: Final learning rate was: {lr_scheduler.get_last_lr()[0]}")
    _ = record(metrics, locals, accelerator, save_file=f"valid_metrics{epoch}.json")
            
def do_epochs(train_dataloader, valid_dataloader, model, optimizer, lr_scheduler, accelerator, config_dict):
    num_epochs = config_dict["num_epochs"]
    for epoch in range(num_epochs):
        logging.info(f"Epoch {epoch + 1} of {num_epochs}")
        train(model, train_dataloader, optimizer, lr_scheduler, epoch, config_dict, accelerator)
        if accelerator.is_main_process:
            _, models_dir = save_ops.get_dirs(config_dict)
            save_ops.save_in(accelerator.unwrap_model(model), models_dir)
            logging.info(f"Finished training loop. Checkpoint saved for epoch {epoch}.")
        accelerator.wait_for_everyone()
        validate(model, valid_dataloader, epoch, accelerator)
    if accelerator.is_main_process:
        logging.info("Training complete.")

def main(accelerator, config_dict):
    model, tokenizer, train_data, valid_data = load_model_tok_data(accelerator, config_dict)

    train_dataloader, valid_dataloader, model, optimizer, lr_scheduler = prepare_for_multi_train(model, tokenizer, train_data, valid_data, accelerator, batch_size=config_dict["hf_training_arguments"]["per_device_train_batch_size"])

    do_epochs(train_dataloader, valid_dataloader, model, optimizer, lr_scheduler, accelerator, config_dict)

The Trainer version is:

def compute_t5_metrics(eval_preds: EvalPrediction):
    predictions, labels = eval_preds.predictions, eval_preds.label_ids

    valid_mask = (labels != -100)
    predicted_token_ids = np.argmax(predictions, axis=-1) # most likely token IDs
    valid_predictions = predicted_token_ids[valid_mask]
    valid_labels = labels[valid_mask]
    corrects_sum = (valid_predictions == valid_labels).sum()
    valid_toks = valid_mask.sum()
    accuracy = corrects_sum.item() / valid_toks.item() if valid_toks.item() > 0 else 0.0
    return {"accuracy": accuracy}

def get_training_args(config_dict):
    pre_args = config_dict["hf_training_arguments"]
    batches_per_epoch = config_dict["batches_per_epoch"]

    pre_args["max_steps"] = config_dict["num_epochs"] * batches_per_epoch
    pre_args["logging_dir"] = os.getcwd()
    pre_args["logging_steps"] = max(1, batches_per_epoch // 100)
    pre_args["eval_strategy"] = "steps"
    pre_args["eval_steps"] = batches_per_epoch
    pre_args["output_dir"] = config_dict["models_dir"]
    pre_args["overwrite_output_dir"] = True
    pre_args["save_strategy"] = "steps"
    pre_args["save_total_limit"] = 5
    pre_args["save_steps"] = batches_per_epoch
    
    train_args = TrainingArguments(**pre_args)
    return train_args

def main_alt(accelerator, config_dict):
    model, tokenizer, train_data, valid_data = load_model_tok_data(accelerator, config_dict)
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, 
        model,
        padding="max_length", 
        max_length=model.config.n_positions
    )
    train_args = get_training_args(config_dict)
    trainer = Trainer(
        model=model,
        args=train_args,
        train_dataset=train_data,
        eval_dataset=valid_data,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_t5_metrics
    )
    
    train_results = trainer.train()

    logging.info(f"Training complete.")
    trainer.save_model()
    trainer.save_metrics("all", train_results.metrics)
    trainer.save_state()
    logging.info(f"Model saved.")

Here is extra system information:

- `transformers` version: 4.52.4
- Platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.31
- Python version: 3.13.1
- Huggingface_hub version: 0.33.0
- Safetensors version: 0.4.5
- Accelerate version: 1.3.0
- Accelerate config: 	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: fp16
	- use_cpu: False
	- debug: True
	- num_processes: 4
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: [0,1,2,3,4,5,6,7]
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- enable_cpu_affinity: True
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Yes
- Using GPU in script?: Yes
- GPU type: Tesla V100-SXM2-32GB-LS

If you wish to rule out the training parameters as the issue, the customised loop logs show that each process/GPU (4 in total) passed through 33878 batches per epoch (ran 3 of them). Thus, for the Trainer loop I used 135512 (=33878*4) batches_per_epoch and 3 num_epochs.

John6666 · July 2, 2025, 12:05pm

While the trainer is user-friendly, it performs various tasks and may not always be fast. If you want to minimize heavy processing, you can try the following approach. (Generated by ChatGPT, so some parts may be inaccurate.)

Since Transformers is a library that prioritizes versatility over speed, using third-party libraries may be worth considering for heavy fine-tuning tasks that require efficient resource utilization.

Summary

Here’s a concise, end-to-end example of a highly optimized Trainer setup that leverages compiler fusion, low-precision optimizers, activation checkpointing, ZeRO sharding, and data‐loading tweaks. Each setting is annotated with its source:

from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer
from transformers import DataCollatorWithFlattening

# 1. Load model & tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")  # base LM :contentReference[oaicite:0]{index=0}
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# 2. Prepare a data collator that packs examples for FlashAttention 2
data_collator = DataCollatorWithFlattening(tokenizer)  # FlashAttention packing :contentReference[oaicite:1]{index=1}

# 3. Define extreme-performance TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    # Mixed-precision FP16 for faster matrix ops and half-memory bandwidth :contentReference[oaicite:2]{index=2}
    fp16=True,
    # PyTorch 2.0 compile to fuse kernels and remove Python overhead :contentReference[oaicite:3]{index=3}
    torch_compile=True,
    torch_compile_backend="inductor",
    # 8-bit AdamW from bitsandbytes: ~4× faster and 75% less memory :contentReference[oaicite:4]{index=4}
    optim="paged_adamw_8bit",
    # Save ~40% memory via gradient checkpointing at the cost of extra forward passes :contentReference[oaicite:5]{index=5}
    gradient_checkpointing=True,
    # ZeRO Stage 2 via DeepSpeed for optimizer/gradient sharding :contentReference[oaicite:6]{index=6}
    deepspeed="ds_zero2_config.json",
    # Batch & accumulation to maximize effective batch size without OOMs
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    # Overlap data loading with GPU compute
    dataloader_num_workers=4,
    dataloader_pin_memory=True,  # faster host→GPU transfers :contentReference[oaicite:7]{index=7}
    # Minimize interruptions: no eval/logging during training
    evaluation_strategy="no",
    save_strategy="epoch",        # only checkpoint once per epoch :contentReference[oaicite:8]{index=8}
    logging_strategy="no",
    disable_tqdm=True,
)

# 4. Instantiate Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,   # your preprocessed dataset
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# 5. Launch training
trainer.train()

Key Optimizations Explained

Mixed Precision (FP16): Activates NVIDIA amp under the hood to speed up GEMMs and reduce memory bandwidth, often yielding 1.5–2× throughput gains on modern GPUs (huggingface.co).
torch.compile: Leverages TorchDynamo+Inductor to fuse Python calls into optimized CUDA kernels, with reported 1.5–3× speedups in training loops (huggingface.co).
8-Bit Optimizer: The paged_adamw_8bit optimizer from bitsandbytes slashes optimizer-state memory by 75% while maintaining numeric fidelity, enabling larger batch sizes and up to 4× faster updates (huggingface.co).
Gradient Checkpointing: Trades extra forward-pass compute for a ~40% reduction in activation memory, crucial when stacking deep transformer layers (sbert.net).
DeepSpeed ZeRO-2: Shards optimizer and gradient states across GPUs (or offloads them to CPU), cutting per-GPU memory and boosting effective scale, without changing your training script (huggingface.co).
FlashAttention Packing: Using DataCollatorWithFlattening packs variable-length sequences into contiguous tensors, minimizing padding waste and unlocking FlashAttention-2’s 2×–4× speedups in the attention layer (huggingface.co).
DataLoader Tuning: Setting dataloader_num_workers>0 and dataloader_pin_memory=True overlaps data loading and host→GPU transfer, smoothing out I/O stalls (github.com).

With all these features enabled, your Trainer will run on par—or even faster—than a bespoke PyTorch loop, while retaining the convenience of built-in checkpointing, distributed support, and logging control.

yonoteam · July 3, 2025, 10:48pm

Thanks for engaging with my question. Sadly, I asked it because no (chat) model can give me a concrete and specific answer. I’ve tried Gemini, Cloud, ChatGPT, and GoogleAI Studio. Their suggestions are often hallucinations (I know because they fail or do not match with the Trainer’s source code), and even when they are not, they do not apply to my particular use case. The primary reason I am using the Trainer is to gain a deeper understanding of it and transfer that knowledge to use the Supervised Fine-Tuning Trainer. I really would like a reply from the human experts about this specific case.

John6666 · July 4, 2025, 12:05am

Hmm, if you want to know more about the technical details of fine-tuning, I think it would be quicker to ask on Hugging Face Discord or Unsloth’s Discord…

Regarding the speed difference between Trainer and PyTorch Trainer, the opposite case can also occur. If you want to make effective use of multi-GPU with Trainer, I think you will need FSDP or DeepSpeed, so there may be some overhead there.

github.com/huggingface/accelerate

The more GPU I use, the slower the training speed.

opened 02:00PM - 27 Oct 21 UTC

closed 12:27PM - 05 Nov 21 UTC

hobbitlzy

I am trying to train the Bert-base-uncased model on Nvidia 3080. However, the st…range thing is, the time spent on one step grows sharply with the number of GPU and the total time using multiple GPUs is similar to single GPU. I directly run the sample code provided on this [link](https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling) and the problem still occurs. BTW, I have run the `transformers.trainer` using multiple GPUs on this machine, and the time per step only increae a little on distributed training. The CUDA version shown by `nvidia-smi` is 11.4 and the environment is: - `transformers` version: 4.11.3 - Platform: Linux-5.11.0-38-generic-x86_64-with-debian-bullseye-sid - Python version: 3.7.6 - PyTorch version (GPU?): 1.9.0+cu111 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: <fill in> - Using distributed or parallel set-up in script?: <fill in> The relevant outputs on two GPUs are: ``` FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions FutureWarning, WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** cuda:0 10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl Num processes: 2 Process index: 0 Local process index: 0 Device: cuda:0 Use FP16 precision: False cuda:1 10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl Num processes: 2 Process index: 1 Local process index: 1 Device: cuda:1 Use FP16 precision: False .......................... 10/28/2021 20:22:28 - INFO - __main__ - ***** Running training ***** 10/28/2021 20:22:28 - INFO - __main__ - Num examples = 4627 10/28/2021 20:22:28 - INFO - __main__ - Num Epochs = 3 10/28/2021 20:22:28 - INFO - __main__ - Instantaneous batch size per device = 2 10/28/2021 20:22:28 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 32 10/28/2021 20:22:28 - INFO - __main__ - Gradient Accumulation steps = 8 10/28/2021 20:22:28 - INFO - __main__ - Total optimization steps = 435 0%|▏ | 1/435 [00:11<1:24:51, 11.73s/it] 10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration. 10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration. 32%|███████████████████████████████▌ | 140/435 [02:52<05:42, 1.16s/it] ``` The outputs on single GPU: ``` 10/28/2021 20:26:47 - INFO - __main__ - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Use FP16 precision: False ....................... 10/28/2021 20:27:49 - INFO - __main__ - ***** Running training ***** 10/28/2021 20:27:49 - INFO - __main__ - Num examples = 4627 10/28/2021 20:27:49 - INFO - __main__ - Num Epochs = 3 10/28/2021 20:27:49 - INFO - __main__ - Instantaneous batch size per device = 2 10/28/2021 20:27:49 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16 10/28/2021 20:27:49 - INFO - __main__ - Gradient Accumulation steps = 8 10/28/2021 20:27:49 - INFO - __main__ - Total optimization steps = 870 4%|███▉ | 35/870 [00:17<06:34, 2.12it/s] ``` The hightlight positions are tjat the time per step sharply increase on distributed training and the total time is similar in two settings.

Topic		Replies	Views
Decreasing performance when using Accelerate 🤗Accelerate	1	2251	March 8, 2022
Trainer is not using multiple GPUs in the DP setup Beginners	0	815	April 9, 2023
Is native Pytorch training loop much slower than Trainer? Intermediate	4	521	November 11, 2024
Besides writing your own training loop, is there any other advantage for using it with deepspeed? 🤗Accelerate	2	583	July 4, 2023
Training with Trainer really slow 🤗Transformers	0	1614	June 12, 2023

Is the Trainer slower than customised loops?

Summary

Key Optimizations Explained

Related topics