Hi,
I am trying to train deit for image classification with different size of images.
Everything works fine but when the training begins i am getting that grad_norm is nan and the loss is 0.0
any suggestions why it happens?
I’m using Trainer with the following TrainingArguments, i tried gradient_accumulation_steps=4 as well
I tried setting bfp16=True, my gpu doesn’t support it. I tried also setting fp16=False and still the grad_norm is nan and loss is 0.0
args = TrainingArguments(
f"{model_name}-finetuned",
remove_unused_columns=False,
evaluation_strategy = "epoch",
save_strategy = "epoch",
fp16=True,
learning_rate=1e-6,
gradient_accumulation_steps=1,
per_device_train_batch_size=bs,
per_device_eval_batch_size=bs,
num_train_epochs=3,
logging_steps=1,
load_best_model_at_end=True,
metric_for_best_model="accuracy",
push_to_hub=False,
)
trainer = Trainer(
model=model,
args=args,
data_collator=collate_fn,
compute_metrics=compute_metrics,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=image_processor,
)