Is the Trainer slower than customised loops?

Hello, I have written two scripts attempting to implement the same training loop, one customised loop and another using the Trainer. The customised loop finished running in approximately 6 hours. Meanwhile, the Trainer loop has been running for 16 hours, and the progress bar only reports 21% progress. Both of them were executed via accelerate launch --gpu_ids 0,1,2,3 train_t5.py config_dict.json. Since I am self-taught and new to these libraries and concepts, I am likely missing some configuration in the Trainer to make the loop go as fast (if not faster) than the customised version. If not, what subprocess is necessary in the Trainer implementation that causes it to be slower? How can I turn it off? Thanks for any help or improvements you can suggest to any of the two scripts.

The customised version is:

def prepare_for_multi_train(model, tokenizer, train_data, valid_data, accelerator, batch_size=8):
    # Dataloaders
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding="max_length", max_length=model.config.n_positions)
    train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=False, collate_fn=data_collator)
    valid_dataloader = DataLoader(valid_data, batch_size=batch_size, shuffle=False, collate_fn=data_collator)

    # Optimizer and Scheduler
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.01)
    lr_scheduler = get_scheduler("constant", optimizer=optimizer)

    # Accelerate them
    train_dataloader = accelerator.prepare(train_dataloader)
    valid_dataloader = accelerator.prepare(valid_dataloader)
    log_dataloading(train_dataloader, accelerator)
    model, optimizer, lr_scheduler = accelerator.prepare(
            model, optimizer, lr_scheduler
    )
    return train_dataloader, valid_dataloader, model, optimizer, lr_scheduler

def load_model_tok_data(accelerator, config_dict):
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        # Tokenizer
        tokenizer = tokops.get_trained_tokenizer(config_dict, making_dirs=True)
        vocab_size = len(tokenizer)

        # Data
        train_kwargs = {
            "tokenizer": tokenizer,
            "json_data_dir": config_dict["data_dir"],
            "split": "train",
            "data_format": config_dict["data_format"]
        }
        train_data = IterableDataset.from_generator(
            tokops.t5_tokked_model_inputs,
            gen_kwargs=train_kwargs
        )
        valid_kwargs = train_kwargs.copy()
        valid_kwargs["split"] = "valid"
        valid_data = IterableDataset.from_generator(
            tokops.t5_tokked_model_inputs,
            gen_kwargs=valid_kwargs
        )

        # Model
        model = get_model(config_dict, vocab_size)
    else:
        tokenizer = None
        train_data, valid_data = None, None
        model = None

    accelerator.wait_for_everyone()
    tokenizer = broadcast_object_list([tokenizer])[0]
    train_data = broadcast_object_list([train_data])[0]
    valid_data = broadcast_object_list([valid_data])[0]
    model = broadcast_object_list([model])[0]    
    logging.info(f"{accelerator.process_index}: Successfully broadcasted data, the evidence is that the type of model is {type(model)}")
    return model, tokenizer, train_data, valid_data

def record(metrics, locals, accelerator, save_file="metrics.json"):
    accelerator.wait_for_everyone()
    local_stats = torch.tensor([locals["loss_sum"], locals["corrects_sum"], locals["valid_toks"], locals["train_step"]], device=accelerator.device)
    global_loss, global_corrects_sum, global_valid_toks, global_train_step = accelerator.reduce(local_stats, reduction="sum")
    if accelerator.is_main_process:
        avg_loss = global_loss.item() / global_train_step.item()
        metrics["loss"].append(avg_loss)
        metrics["accuracy"].append(global_corrects_sum.item() / global_valid_toks.item())
        metrics["steps"].append(global_train_step.item())
        logging.info(f"Current step's ({locals["train_step"]}) average loss is {avg_loss:.4f}")
        dicts.save_as_json(metrics, save_file)
    accelerator.wait_for_everyone()
    return metrics

def validate(model, dataloader, epoch, accelerator):
    model.eval()
    process_idx = accelerator.process_index
    metrics = {"loss": [], "accuracy": [], "steps": []}
    locals = {"loss_sum": 0.0, "corrects_sum": 0, "valid_toks": 0, "train_step": 0}
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            outputs = model(
                input_ids=batch["input_ids"], 
                attention_mask=batch["attention_mask"], 
                labels=batch["labels"]
            )
            loss = outputs.loss

            # Metrics
            predictions = outputs.logits.argmax(dim=-1)
            valids_mask = batch["labels"] != -100 # tokenizer.pad_token_id
            corrects = (predictions[valids_mask] == batch["labels"][valids_mask]).sum().item()
            locals["corrects_sum"] += corrects
            locals["valid_toks"] += valids_mask.sum().item()
            locals["loss_sum"] += loss.item()
            locals["train_step"] = batch_idx + 1

            if batch_idx % 1000 == 0:
                logging.info(f"{process_idx}: valid step number {batch_idx}")
                metrics = record(metrics, locals, accelerator, save_file=f"valid_metrics{epoch}.json")
    metrics = record(metrics, locals, accelerator, save_file=f"valid_metrics{epoch}.json")

def train(model, dataloader, optimizer, lr_scheduler, epoch, config_dict, accelerator):
    model.train()
    process_idx = accelerator.process_index
    metrics = {"loss": [], "accuracy": [], "steps": []}
    locals = {"loss_sum": 0.0, "corrects_sum": 0, "valid_toks": 0, "train_step": 0}
    for batch_idx, batch in enumerate(dataloader):
        # model.forward() and loss calculation
        outputs = model(
            input_ids=batch["input_ids"], 
            attention_mask=batch["attention_mask"], 
            labels=batch["labels"]
        )
        loss = outputs.loss
        log_nan_loss(loss, batch_idx, accelerator)
        
        # Backpropagation
        optimizer.zero_grad()
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()

        # Metrics
        predictions = outputs.logits.argmax(dim=-1)
        valids_mask = batch["labels"] != -100
        corrects = (predictions[valids_mask] == batch["labels"][valids_mask]).sum().item()
        locals["corrects_sum"] += corrects
        locals["valid_toks"] += valids_mask.sum().item()
        locals["loss_sum"] += loss.item()
        locals["train_step"] = batch_idx + 1

        # progress feedback
        if batch_idx % 1000 == 0:
            logging.info(f"{process_idx}: train step number {batch_idx}")
            metrics = record(metrics, locals, accelerator, save_file=f"train_metrics{epoch}.json")
        if batch_idx % 10000 == 0 and batch_idx > 0:
                if accelerator.is_main_process:
                    _, models_dir = save_ops.get_dirs(config_dict)
                    save_ops.save_in(accelerator.unwrap_model(model), models_dir)
        accelerator.wait_for_everyone()

    logging.info(f"{process_idx}: Total number of batches was {batch_idx + 1}")
    logging.info(f"{process_idx}: Final learning rate was: {lr_scheduler.get_last_lr()[0]}")
    _ = record(metrics, locals, accelerator, save_file=f"valid_metrics{epoch}.json")
            
def do_epochs(train_dataloader, valid_dataloader, model, optimizer, lr_scheduler, accelerator, config_dict):
    num_epochs = config_dict["num_epochs"]
    for epoch in range(num_epochs):
        logging.info(f"Epoch {epoch + 1} of {num_epochs}")
        train(model, train_dataloader, optimizer, lr_scheduler, epoch, config_dict, accelerator)
        if accelerator.is_main_process:
            _, models_dir = save_ops.get_dirs(config_dict)
            save_ops.save_in(accelerator.unwrap_model(model), models_dir)
            logging.info(f"Finished training loop. Checkpoint saved for epoch {epoch}.")
        accelerator.wait_for_everyone()
        validate(model, valid_dataloader, epoch, accelerator)
    if accelerator.is_main_process:
        logging.info("Training complete.")

def main(accelerator, config_dict):
    model, tokenizer, train_data, valid_data = load_model_tok_data(accelerator, config_dict)

    train_dataloader, valid_dataloader, model, optimizer, lr_scheduler = prepare_for_multi_train(model, tokenizer, train_data, valid_data, accelerator, batch_size=config_dict["hf_training_arguments"]["per_device_train_batch_size"])

    do_epochs(train_dataloader, valid_dataloader, model, optimizer, lr_scheduler, accelerator, config_dict)

The Trainer version is:

def compute_t5_metrics(eval_preds: EvalPrediction):
    predictions, labels = eval_preds.predictions, eval_preds.label_ids

    valid_mask = (labels != -100)
    predicted_token_ids = np.argmax(predictions, axis=-1) # most likely token IDs
    valid_predictions = predicted_token_ids[valid_mask]
    valid_labels = labels[valid_mask]
    corrects_sum = (valid_predictions == valid_labels).sum()
    valid_toks = valid_mask.sum()
    accuracy = corrects_sum.item() / valid_toks.item() if valid_toks.item() > 0 else 0.0
    return {"accuracy": accuracy}

def get_training_args(config_dict):
    pre_args = config_dict["hf_training_arguments"]
    batches_per_epoch = config_dict["batches_per_epoch"]

    pre_args["max_steps"] = config_dict["num_epochs"] * batches_per_epoch
    pre_args["logging_dir"] = os.getcwd()
    pre_args["logging_steps"] = max(1, batches_per_epoch // 100)
    pre_args["eval_strategy"] = "steps"
    pre_args["eval_steps"] = batches_per_epoch
    pre_args["output_dir"] = config_dict["models_dir"]
    pre_args["overwrite_output_dir"] = True
    pre_args["save_strategy"] = "steps"
    pre_args["save_total_limit"] = 5
    pre_args["save_steps"] = batches_per_epoch
    
    train_args = TrainingArguments(**pre_args)
    return train_args

def main_alt(accelerator, config_dict):
    model, tokenizer, train_data, valid_data = load_model_tok_data(accelerator, config_dict)
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, 
        model,
        padding="max_length", 
        max_length=model.config.n_positions
    )
    train_args = get_training_args(config_dict)
    trainer = Trainer(
        model=model,
        args=train_args,
        train_dataset=train_data,
        eval_dataset=valid_data,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_t5_metrics
    )
    
    train_results = trainer.train()

    logging.info(f"Training complete.")
    trainer.save_model()
    trainer.save_metrics("all", train_results.metrics)
    trainer.save_state()
    logging.info(f"Model saved.")

Here is extra system information:

- `transformers` version: 4.52.4
- Platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.31
- Python version: 3.13.1
- Huggingface_hub version: 0.33.0
- Safetensors version: 0.4.5
- Accelerate version: 1.3.0
- Accelerate config: 	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: fp16
	- use_cpu: False
	- debug: True
	- num_processes: 4
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: [0,1,2,3,4,5,6,7]
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- enable_cpu_affinity: True
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Yes
- Using GPU in script?: Yes
- GPU type: Tesla V100-SXM2-32GB-LS

If you wish to rule out the training parameters as the issue, the customised loop logs show that each process/GPU (4 in total) passed through 33878 batches per epoch (ran 3 of them). Thus, for the Trainer loop I used 135512 (=33878*4) batches_per_epoch and 3 num_epochs.

1 Like

While the trainer is user-friendly, it performs various tasks and may not always be fast. If you want to minimize heavy processing, you can try the following approach. (Generated by ChatGPT, so some parts may be inaccurate.)

Since Transformers is a library that prioritizes versatility over speed, using third-party libraries may be worth considering for heavy fine-tuning tasks that require efficient resource utilization.


Summary

Here’s a concise, end-to-end example of a highly optimized Trainer setup that leverages compiler fusion, low-precision optimizers, activation checkpointing, ZeRO sharding, and data‐loading tweaks. Each setting is annotated with its source:

from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer
from transformers import DataCollatorWithFlattening

# 1. Load model & tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")  # base LM :contentReference[oaicite:0]{index=0}
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# 2. Prepare a data collator that packs examples for FlashAttention 2
data_collator = DataCollatorWithFlattening(tokenizer)  # FlashAttention packing :contentReference[oaicite:1]{index=1}

# 3. Define extreme-performance TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    # Mixed-precision FP16 for faster matrix ops and half-memory bandwidth :contentReference[oaicite:2]{index=2}
    fp16=True,
    # PyTorch 2.0 compile to fuse kernels and remove Python overhead :contentReference[oaicite:3]{index=3}
    torch_compile=True,
    torch_compile_backend="inductor",
    # 8-bit AdamW from bitsandbytes: ~4× faster and 75% less memory :contentReference[oaicite:4]{index=4}
    optim="paged_adamw_8bit",
    # Save ~40% memory via gradient checkpointing at the cost of extra forward passes :contentReference[oaicite:5]{index=5}
    gradient_checkpointing=True,
    # ZeRO Stage 2 via DeepSpeed for optimizer/gradient sharding :contentReference[oaicite:6]{index=6}
    deepspeed="ds_zero2_config.json",
    # Batch & accumulation to maximize effective batch size without OOMs
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    # Overlap data loading with GPU compute
    dataloader_num_workers=4,
    dataloader_pin_memory=True,  # faster host→GPU transfers :contentReference[oaicite:7]{index=7}
    # Minimize interruptions: no eval/logging during training
    evaluation_strategy="no",
    save_strategy="epoch",        # only checkpoint once per epoch :contentReference[oaicite:8]{index=8}
    logging_strategy="no",
    disable_tqdm=True,
)

# 4. Instantiate Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,   # your preprocessed dataset
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# 5. Launch training
trainer.train()

Key Optimizations Explained

  1. Mixed Precision (FP16): Activates NVIDIA amp under the hood to speed up GEMMs and reduce memory bandwidth, often yielding 1.5–2× throughput gains on modern GPUs (huggingface.co).
  2. torch.compile: Leverages TorchDynamo+Inductor to fuse Python calls into optimized CUDA kernels, with reported 1.5–3× speedups in training loops (huggingface.co).
  3. 8-Bit Optimizer: The paged_adamw_8bit optimizer from bitsandbytes slashes optimizer-state memory by 75% while maintaining numeric fidelity, enabling larger batch sizes and up to 4× faster updates (huggingface.co).
  4. Gradient Checkpointing: Trades extra forward-pass compute for a ~40% reduction in activation memory, crucial when stacking deep transformer layers (sbert.net).
  5. DeepSpeed ZeRO-2: Shards optimizer and gradient states across GPUs (or offloads them to CPU), cutting per-GPU memory and boosting effective scale, without changing your training script (huggingface.co).
  6. FlashAttention Packing: Using DataCollatorWithFlattening packs variable-length sequences into contiguous tensors, minimizing padding waste and unlocking FlashAttention-2’s 2×–4× speedups in the attention layer (huggingface.co).
  7. DataLoader Tuning: Setting dataloader_num_workers>0 and dataloader_pin_memory=True overlaps data loading and host→GPU transfer, smoothing out I/O stalls (github.com).

With all these features enabled, your Trainer will run on par—or even faster—than a bespoke PyTorch loop, while retaining the convenience of built-in checkpointing, distributed support, and logging control.

Thanks for engaging with my question. Sadly, I asked it because no (chat) model can give me a concrete and specific answer. I’ve tried Gemini, Cloud, ChatGPT, and GoogleAI Studio. Their suggestions are often hallucinations (I know because they fail or do not match with the Trainer’s source code), and even when they are not, they do not apply to my particular use case. The primary reason I am using the Trainer is to gain a deeper understanding of it and transfer that knowledge to use the Supervised Fine-Tuning Trainer. I really would like a reply from the human experts about this specific case.

1 Like

Hmm, if you want to know more about the technical details of fine-tuning, I think it would be quicker to ask on Hugging Face Discord or Unsloth’s Discord…

Regarding the speed difference between Trainer and PyTorch Trainer, the opposite case can also occur. If you want to make effective use of multi-GPU with Trainer, I think you will need FSDP or DeepSpeed, so there may be some overhead there.