When Fine-Tune the google/vit-base-patch16-384, the train loss is 0 and the eval loss is NaN

When Fine-Tune the google/vit-base-patch16-384, the train loss is 0 and the eval loss is NaN. However, the google/vit-base-patch16-224 can be tuned correctly. That’s confused me.
I have checked my data and the processor. I’m sure the processor loaded from ViTImageProcessor is correct and the resolution of my image data is 384x384.
I have tried to use bf16, fp16 and also the fp32 for training, but it’s not working.
Here is my TrainingArguments:

training_args = TrainingArguments(
  run_name='vit-base-patch16-384-ToGenus',
  output_dir="./check_point",
  remove_unused_columns=False,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  evaluation_strategy="steps",
  save_strategy="steps",
  logging_strategy="steps",
  logging_steps=10,
  eval_steps=10,
  num_train_epochs=10,
  bf16=True,
  learning_rate=5e-8,
  save_total_limit=2,
  load_best_model_at_end=True,
  metric_for_best_model='accuracy',
)

How can I solve it :smiling_face_with_tear:

Hmm interesting, thanks for reporting.

I’d recommend to try and see if the model is able to overfit (i.e. achieve 100% accuracy) as little as 2 examples. If it can, then it shows the issue is not on the modeling side.

As explained in this guide to debug neural networks: A Recipe for Training Neural Networks

I also have experienced this issue despite of the model. I think it is more kind of issue related to trainer in transformers.

Update: I see you’re using a learning rate of 5e-8. That’s too low. I’d try higher learning rates like 2e-4, 3e-4, 5e-5.

well it is true that learning rate is too low. But in my case even though the learning rate is 2e-4 sometimes it appers as follows

{‘eval_loss’: nan, ‘eval_bbox_AP50:95’: 0.0, ‘eval_bbox_AP50’: 0.0, ‘eval_runtime’: 9.4981, ‘eval_samples_per_second’: 4.106, ’
eval_steps_per_second’: 1.053, ‘epoch’: 0.97}
8%|███████▍ | 19/228 [01:23<07:47, 2.24s/itCheckpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Removed shared tensor {‘class_embed.4.bias’, ‘class_embed.6.bias’, ‘class_embed.3.weight’, ‘class_embed.1.weight’, ‘class_embed.2.bias’, ‘class_embed.2.weight’, ‘class_embed.4.weight’, ‘class_embed.5.weight’, ‘class_embed.0.weight’, ‘class_embed.6.weight’, ‘class_embed.5.bias’, ‘class_embed.3.bias’, ‘class_embed.0.bias’, ‘class_embed.1.bias’} while saving. This should be OK, but check by verifying that you don’t receive any warning while reloading
9%|███████▊ | 20/228 [01:36<29:33, 8.53s/it]
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 8.68}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 8.68}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 8.68}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 8.68}
{‘loss’: 0.0, ‘learning_rate’: 0.0002, ‘epoch’: 1.03}
11%|█████████▊ | 25/228 [01:45<10:03, 2.97s/it]
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 10.86}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 10.86}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 10.86}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 10.86}
{‘loss’: 0.0, ‘learning_rate’: 0.0002, ‘epoch’: 1.28}

Thank you for your advice. To confirm whether the model is overfit in my dataset, I use the trainer.evaluate(eval_dataset=normalize_ds[“train”]) to obtain the train acc, and it’s very low(0.018), similar to the eval acc.
Regarding the learning rate that you mentioned below, I have tried 2e-3, 2e-4. 5e-5. 5e-6, the train loss is still 0 and the eval loss is NaN. :sob:

What’s your Transformers version? Could you provide a reproducer?

I recommend you to try just float32! In my case accelerate, ddp, --mixed_precision fp16 rarely gives train loss 0 and eval loss is NaN. The keypoint is rarely… :frowning:

  • transformers version: 4.37.0.dev0
  • Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.2
  • Safetensors version: 0.4.1
  • Accelerate version: 0.26.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Here is my sys info:

- `transformers` version: 4.36.2
- Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
- Python version: 3.10.13
- Huggingface_hub version: 0.20.2
- Safetensors version: 0.4.1
- Accelerate version: 0.26.1
- PyTorch version: 2.1.2+cu121 (True)

I observed the same when fine-tuning convnextv2. The code was working before updating transformers. Setting fp16 to false I thought would fix it but it still happened intermittently. Reverting to transformers version 4.33.0 has fixed it for me.