When Fine-Tune the google/vit-base-patch16-384, the train loss is 0 and the eval loss is NaN

Hua-Jiu · January 17, 2024, 12:47pm

When Fine-Tune the google/vit-base-patch16-384, the train loss is 0 and the eval loss is NaN. However, the google/vit-base-patch16-224 can be tuned correctly. That’s confused me.
I have checked my data and the processor. I’m sure the processor loaded from ViTImageProcessor is correct and the resolution of my image data is 384x384.
I have tried to use bf16, fp16 and also the fp32 for training, but it’s not working.
Here is my TrainingArguments:

training_args = TrainingArguments(
  run_name='vit-base-patch16-384-ToGenus',
  output_dir="./check_point",
  remove_unused_columns=False,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  evaluation_strategy="steps",
  save_strategy="steps",
  logging_strategy="steps",
  logging_steps=10,
  eval_steps=10,
  num_train_epochs=10,
  bf16=True,
  learning_rate=5e-8,
  save_total_limit=2,
  load_best_model_at_end=True,
  metric_for_best_model='accuracy',
)

How can I solve it

nielsr · January 17, 2024, 6:19pm

Hmm interesting, thanks for reporting.

I’d recommend to try and see if the model is able to overfit (i.e. achieve 100% accuracy) as little as 2 examples. If it can, then it shows the issue is not on the modeling side.

As explained in this guide to debug neural networks: A Recipe for Training Neural Networks

sbchoi · January 18, 2024, 7:19am

I also have experienced this issue despite of the model. I think it is more kind of issue related to trainer in transformers.

nielsr · January 18, 2024, 11:03am

Update: I see you’re using a learning rate of 5e-8. That’s too low. I’d try higher learning rates like 2e-4, 3e-4, 5e-5.

sbchoi · January 18, 2024, 12:32pm

well it is true that learning rate is too low. But in my case even though the learning rate is 2e-4 sometimes it appers as follows

{‘eval_loss’: nan, ‘eval_bbox_AP50:95’: 0.0, ‘eval_bbox_AP50’: 0.0, ‘eval_runtime’: 9.4981, ‘eval_samples_per_second’: 4.106, ’
eval_steps_per_second’: 1.053, ‘epoch’: 0.97}
8%|███████▍ | 19/228 [01:23<07:47, 2.24s/itCheckpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Removed shared tensor {‘class_embed.4.bias’, ‘class_embed.6.bias’, ‘class_embed.3.weight’, ‘class_embed.1.weight’, ‘class_embed.2.bias’, ‘class_embed.2.weight’, ‘class_embed.4.weight’, ‘class_embed.5.weight’, ‘class_embed.0.weight’, ‘class_embed.6.weight’, ‘class_embed.5.bias’, ‘class_embed.3.bias’, ‘class_embed.0.bias’, ‘class_embed.1.bias’} while saving. This should be OK, but check by verifying that you don’t receive any warning while reloading
9%|███████▊ | 20/228 [01:36<29:33, 8.53s/it]
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 8.68}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 8.68}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 8.68}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 8.68}
{‘loss’: 0.0, ‘learning_rate’: 0.0002, ‘epoch’: 1.03}
11%|█████████▊ | 25/228 [01:45<10:03, 2.97s/it]
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 10.86}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 10.86}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 10.86}
INFO:model.api:patch_model_train_progress: {‘training_stage’: ‘training’, ‘progress’: 10.86}
{‘loss’: 0.0, ‘learning_rate’: 0.0002, ‘epoch’: 1.28}

Hua-Jiu · January 18, 2024, 2:18pm

Thank you for your advice. To confirm whether the model is overfit in my dataset, I use the trainer.evaluate(eval_dataset=normalize_ds[“train”]) to obtain the train acc, and it’s very low(0.018), similar to the eval acc.
Regarding the learning rate that you mentioned below, I have tried 2e-3, 2e-4. 5e-5. 5e-6, the train loss is still 0 and the eval loss is NaN.

nielsr · January 18, 2024, 4:08pm

What’s your Transformers version? Could you provide a reproducer?

sbchoi · January 18, 2024, 11:36pm

I recommend you to try just float32! In my case accelerate, ddp, --mixed_precision fp16 rarely gives train loss 0 and eval loss is NaN. The keypoint is rarely…

transformers version: 4.37.0.dev0
Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Hua-Jiu · January 19, 2024, 1:54am

Here is my sys info:

- `transformers` version: 4.36.2
- Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
- Python version: 3.10.13
- Huggingface_hub version: 0.20.2
- Safetensors version: 0.4.1
- Accelerate version: 0.26.1
- PyTorch version: 2.1.2+cu121 (True)

sitwala · January 19, 2024, 5:22pm

I observed the same when fine-tuning convnextv2. The code was working before updating transformers. Setting fp16 to false I thought would fix it but it still happened intermittently. Reverting to transformers version 4.33.0 has fixed it for me.

Topic		Replies	Views
I'm failing to train a vit_base_patch16_224 model for creating high quality embeddings for screenshots Models	0	36	September 5, 2024
Can't Load ViT Model for Fine Tuning 🤗Transformers	2	1503	August 11, 2022
Help! - Drastic Overfitting and Atrocious Accuracy on ViT Model 🤗Transformers	0	700	July 23, 2022
Fine-tuning ViT with more patches/higher resolution Intermediate	3	3609	December 26, 2022
Serious issue regarding channel dimensions with respect to configuration during training a vision transformer Beginners	2	501	August 26, 2024

When Fine-Tune the google/vit-base-patch16-384, the train loss is 0 and the eval loss is NaN

Related topics