Hua-Jiu
January 17, 2024, 12:47pm
1
When Fine-Tune the google/vit-base-patch16-384, the train loss is 0 and the eval loss is NaN. However, the google/vit-base-patch16-224 can be tuned correctly. Thatâs confused me.
I have checked my data and the processor. Iâm sure the processor loaded from ViTImageProcessor is correct and the resolution of my image data is 384x384.
I have tried to use bf16, fp16 and also the fp32 for training, but itâs not working.
Here is my TrainingArguments:
training_args = TrainingArguments(
run_name='vit-base-patch16-384-ToGenus',
output_dir="./check_point",
remove_unused_columns=False,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
evaluation_strategy="steps",
save_strategy="steps",
logging_strategy="steps",
logging_steps=10,
eval_steps=10,
num_train_epochs=10,
bf16=True,
learning_rate=5e-8,
save_total_limit=2,
load_best_model_at_end=True,
metric_for_best_model='accuracy',
)
How can I solve it
nielsr
January 17, 2024, 6:19pm
2
Hmm interesting, thanks for reporting.
Iâd recommend to try and see if the model is able to overfit (i.e. achieve 100% accuracy) as little as 2 examples. If it can, then it shows the issue is not on the modeling side.
As explained in this guide to debug neural networks: A Recipe for Training Neural Networks
sbchoi
January 18, 2024, 7:19am
3
I also have experienced this issue despite of the model. I think it is more kind of issue related to trainer in transformers.
nielsr
January 18, 2024, 11:03am
4
Update: I see youâre using a learning rate of 5e-8. Thatâs too low. Iâd try higher learning rates like 2e-4, 3e-4, 5e-5.
sbchoi
January 18, 2024, 12:32pm
5
well it is true that learning rate is too low. But in my case even though the learning rate is 2e-4 sometimes it appers as follows
{âeval_lossâ: nan, âeval_bbox_AP50:95â: 0.0, âeval_bbox_AP50â: 0.0, âeval_runtimeâ: 9.4981, âeval_samples_per_secondâ: 4.106, â
eval_steps_per_secondâ: 1.053, âepochâ: 0.97}
8%|ââââââââ | 19/228 [01:23<07:47, 2.24s/itCheckpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output/checkpoint-19 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Removed shared tensor {âclass_embed.4.biasâ, âclass_embed.6.biasâ, âclass_embed.3.weightâ, âclass_embed.1.weightâ, âclass_embed.2.biasâ, âclass_embed.2.weightâ, âclass_embed.4.weightâ, âclass_embed.5.weightâ, âclass_embed.0.weightâ, âclass_embed.6.weightâ, âclass_embed.5.biasâ, âclass_embed.3.biasâ, âclass_embed.0.biasâ, âclass_embed.1.biasâ} while saving. This should be OK, but check by verifying that you donât receive any warning while reloading
9%|ââââââââ | 20/228 [01:36<29:33, 8.53s/it]
INFO:model.api:patch_model_train_progress: {âtraining_stageâ: âtrainingâ, âprogressâ: 8.68}
INFO:model.api:patch_model_train_progress: {âtraining_stageâ: âtrainingâ, âprogressâ: 8.68}
INFO:model.api:patch_model_train_progress: {âtraining_stageâ: âtrainingâ, âprogressâ: 8.68}
INFO:model.api:patch_model_train_progress: {âtraining_stageâ: âtrainingâ, âprogressâ: 8.68}
{âlossâ: 0.0, âlearning_rateâ: 0.0002, âepochâ: 1.03}
11%|ââââââââââ | 25/228 [01:45<10:03, 2.97s/it]
INFO:model.api:patch_model_train_progress: {âtraining_stageâ: âtrainingâ, âprogressâ: 10.86}
INFO:model.api:patch_model_train_progress: {âtraining_stageâ: âtrainingâ, âprogressâ: 10.86}
INFO:model.api:patch_model_train_progress: {âtraining_stageâ: âtrainingâ, âprogressâ: 10.86}
INFO:model.api:patch_model_train_progress: {âtraining_stageâ: âtrainingâ, âprogressâ: 10.86}
{âlossâ: 0.0, âlearning_rateâ: 0.0002, âepochâ: 1.28}
Thank you for your advice. To confirm whether the model is overfit in my dataset, I use the trainer.evaluate(eval_dataset=normalize_ds[âtrainâ]) to obtain the train acc, and itâs very low(0.018), similar to the eval acc.
Regarding the learning rate that you mentioned below, I have tried 2e-3, 2e-4. 5e-5. 5e-6 , the train loss is still 0 and the eval loss is NaN.
nielsr
January 18, 2024, 4:08pm
7
Whatâs your Transformers version? Could you provide a reproducer?
sbchoi
January 18, 2024, 11:36pm
8
I recommend you to try just float32! In my case accelerate, ddp, --mixed_precision fp16 rarely gives train loss 0 and eval loss is NaN. The keypoint is rarelyâŚ
transformers
version: 4.37.0.dev0
Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:
Here is my sys info:
- `transformers` version: 4.36.2
- Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
- Python version: 3.10.13
- Huggingface_hub version: 0.20.2
- Safetensors version: 0.4.1
- Accelerate version: 0.26.1
- PyTorch version: 2.1.2+cu121 (True)
sitwala
January 19, 2024, 5:22pm
10
I observed the same when fine-tuning convnextv2. The code was working before updating transformers. Setting fp16 to false I thought would fix it but it still happened intermittently. Reverting to transformers version 4.33.0 has fixed it for me.