Is native Pytorch training loop much slower than Trainer?

Hi all,

I am working on a text classification task with a “distilbert-base-uncased” checkpoint and the dataset “emotion”. When I finetune the model, I average 0.34s/it when using the HF function Trainer but when I use the native Pytorch training I get 29.16s/it. What am I doing wrong? Below are the two snippets, the bulk of the code is taken from Fine-tune a pretrained model.

training_args = TrainingArguments(
    output_dir=model_name,
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    disable_tqdm=False,
    log_level="error",
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=emotions_encoded["train"],
    eval_dataset=emotions_encoded["validation"],
    tokenizer=tokenizer,
)
trainer.train()
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)

        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Hi @giusmatera!

Commonly, the last layer of the model will be randomly initialized and trained in fine-tuning process. Therefore, there is no need to calculate previous gradients for the fine-tuning process. However, in the PyTorch code, if you didn’t set requires_grad=False for the other layers, it is probably training the whole architecture instead of just the last layer. I believe that is the main reason.

Additionally, I couldn’t find any comparisons between HuggingFace Trainer and PyTorch Trainer. So, I am not sure of the answer to that question.

Hi @bariskurtkaya! Thanks so much for your reply!

I suspect (but I might be wrong) that Trainer() fine-tunes per se all the weights of the pre-trained model. This can be checked using the example from Fine-tune a pretrained model, with [param for param in model.parameters()] right before and right after the call of Trainer(). Could the author of the tutorial - @sgugger - confirm?

My idea was that something in the implementation of the native Pytorch training loop was off, but I was not able to understand what. The only reference I found online is Huggingface Transformers (PyTorch) - Custom training loop doubles speed? - Stack Overflow, where they face the opposite problem i.e., native Pytorch training loop performs better than Trainer.

Could anyone help?

Best,
Giuseppe

Hi @giusmatera.

Did you solve the problem? I’m having a similar issue but with GPT-2.

1 Like

You’re spot on! If requires_ grad isn’t set to False for earlier layers, Py Torch ends up training the whole model instead of just the last layer. Freezing the earlier layers by setting requires_ grad=False` helps focus training where it’s needed.