Im trying to fine tune llama2 but i get the same error for different models. I am using the default settings form HF docker image. It starts fine tuning and stops after the third entry. Here is an example of the dataset peshkatari/autotrain-data-test-data · Datasets at Hugging Face
/app/env/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
2%|▏ | 1/66 [00:03<03:44, 3.45s/it]
3%|▎ | 2/66 [00:05<02:35, 2.43s/it]/app/env/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
{'loss': 1.6262, 'learning_rate': 1.7142857142857142e-05, 'epoch': 1.09}
5%|▍ | 3/66 [00:08<02:50, 2.71s/it]
6%|▌ | 4/66 [00:09<02:23, 2.31s/it]
6%|▌ | 4/66 [00:09<02:23, 2.31s/it]/app/env/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
{'train_runtime': 14.9248, 'train_samples_per_second': 8.643, 'train_steps_per_second': 4.422, 'train_loss': 1.612555980682373, 'epoch': 2.09}
8%|▊ | 5/66 [00:12<02:37, 2.59s/it]
9%|▉ | 6/66 [00:14<02:17, 2.29s/it]
9%|▉ | 6/66 [00:14<02:17, 2.29s/it]
9%|▉ | 6/66 [00:14<02:29, 2.49s/it]
🚀 INFO | 2024-01-02 15:12:53 | __main__:train:477 - Finished training, saving model...