Hi all,
I am working on a text classification task with a “distilbert-base-uncased” checkpoint and the dataset “emotion”. When I finetune the model, I average 0.34s/it when using the HF function Trainer
but when I use the native Pytorch training I get 29.16s/it. What am I doing wrong? Below are the two snippets, the bulk of the code is taken from Fine-tune a pretrained model.
training_args = TrainingArguments(
output_dir=model_name,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
log_level="error",
)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=emotions_encoded["train"],
eval_dataset=emotions_encoded["validation"],
tokenizer=tokenizer,
)
trainer.train()
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
Hi @giusmatera!
Commonly, the last layer of the model will be randomly initialized and trained in fine-tuning process. Therefore, there is no need to calculate previous gradients for the fine-tuning process. However, in the PyTorch code, if you didn’t set requires_grad=False for the other layers, it is probably training the whole architecture instead of just the last layer. I believe that is the main reason.
Additionally, I couldn’t find any comparisons between HuggingFace Trainer and PyTorch Trainer. So, I am not sure of the answer to that question.
Hi @bariskurtkaya! Thanks so much for your reply!
I suspect (but I might be wrong) that Trainer()
fine-tunes per se all the weights of the pre-trained model. This can be checked using the example from Fine-tune a pretrained model, with [param for param in model.parameters()]
right before and right after the call of Trainer()
. Could the author of the tutorial - @sgugger - confirm?
My idea was that something in the implementation of the native Pytorch training loop was off, but I was not able to understand what. The only reference I found online is Huggingface Transformers (PyTorch) - Custom training loop doubles speed? - Stack Overflow, where they face the opposite problem i.e., native Pytorch training loop performs better than Trainer.
Could anyone help?
Best,
Giuseppe
Hi @giusmatera.
Did you solve the problem? I’m having a similar issue but with GPT-2.
1 Like
You’re spot on! If requires_ grad
isn’t set to False
for earlier layers, Py Torch ends up training the whole model instead of just the last layer. Freezing the earlier layers by setting requires_ grad=False` helps focus training where it’s needed.