LLama 2 (meta-llama/Llama-2-7b-hf) fine-tunning

Below are logs during the finetune model.
after some time got the error, please look at the log and provide appropriate solution for this.
Thanks

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1/2 [00:03<00:03, 3.15s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:03<00:00, 1.62s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:03<00:00, 1.85s/it]
/app/env/lib/python3.9/site-packages/transformers/utils/hub.py:374: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(

Downloading (…)neration_config.json: 0%| | 0.00/188 [00:00<?, ?B/s]
Downloading (…)neration_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 188/188 [00:00<00:00, 87.2kB/s]

INFO Using block size 1024
INFO creating trainer

0%| | 0/228 [00:00<?, ?it/s]You’re using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(

0%| | 1/228 [00:04<16:11, 4.28s/it]
1%| | 2/228 [00:07<13:34, 3.60s/it]
1%|▏ | 3/228 [00:10<12:42, 3.39s/it]
2%|▏ | 4/228 [00:13<12:16, 3.29s/it]
2%|▏ | 5/228 [00:16<12:00, 3.23s/it]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(

3%|β–Ž | 6/228 [00:20<12:04, 3.26s/it]
3%|β–Ž | 7/228 [00:23<11:51, 3.22s/it]
4%|β–Ž | 8/228 [00:26<11:41, 3.19s/it]
4%|▍ | 9/228 [00:29<11:34, 3.17s/it]
4%|▍ | 10/228 [00:32<11:28, 3.16s/it]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
{β€˜loss’: 1.4006, β€˜learning_rate’: 0.00013043478260869567, β€˜epoch’: 2.07}
{β€˜train_runtime’: 48.6969, β€˜train_samples_per_second’: 9.364, β€˜train_steps_per_second’: 4.682, β€˜train_loss’: 1.4006022135416667, β€˜epoch’: 2.07}

5%|▍ | 11/228 [00:35<11:36, 3.21s/it]
5%|β–Œ | 12/228 [00:39<11:28, 3.19s/it]
6%|β–Œ | 13/228 [00:42<11:21, 3.17s/it]
6%|β–Œ | 14/228 [00:45<11:16, 3.16s/it]
7%|β–‹ | 15/228 [00:48<11:11, 3.15s/it]

7%|β–‹ | 15/228 [00:48<11:11, 3.15s/it]

7%|β–‹ | 15/228 [00:48<11:11, 3.15s/it]
7%|β–‹ | 15/228 [00:48<11:31, 3.25s/it]

INFO Finished training, saving model…
INFO Pushing model to hub…

adapter_model.bin: 0%| | 0.00/33.6M [00:00<?, ?B/s]

adapter_model.bin: 0%| | 0.00/33.6M [00:00<?, ?B/s]

rng_state.pth: 0%| | 0.00/14.2k [00:00<?, ?B/s]

Upload 10 LFS files: 0%| | 0/10 [00:00<?, ?it/s]

optimizer.pt: 0%| | 0.00/67.2M [00:00<?, ?B/s]

scheduler.pt: 0%| | 0.00/1.06k [00:00<?, ?B/s]
adapter_model.bin: 0%| | 8.19k/33.6M [00:00<08:43, 64.2kB/s]

rng_state.pth: 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 8.19k/14.2k [00:00<00:00, 63.7kB/s]

optimizer.pt: 0%| | 8.19k/67.2M [00:00<17:24, 64.3kB/s]

adapter_model.bin: 0%| | 8.19k/33.6M [00:00<09:04, 61.7kB/s]

scheduler.pt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.06k/1.06k [00:00<00:00, 8.23kB/s]
scheduler.pt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.06k/1.06k [00:00<00:00, 5.47kB/s]

adapter_model.bin: 15%|β–ˆβ– | 4.90M/33.6M [00:00<00:01, 25.8MB/s]

optimizer.pt: 8%|β–Š | 5.17M/67.2M [00:00<00:02, 27.3MB/s]

adapter_model.bin: 15%|β–ˆβ–Œ | 5.14M/33.6M [00:00<00:01, 26.6MB/s]
rng_state.pth: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 14.2k/14.2k [00:00<00:00, 56.3kB/s]

optimizer.pt: 16%|β–ˆβ–Œ | 10.6M/67.2M [00:00<00:01, 39.0MB/s]
adapter_model.bin: 26%|β–ˆβ–ˆβ–Œ | 8.71M/33.6M [00:00<00:00, 29.3MB/s]

adapter_model.bin: 26%|β–ˆβ–ˆβ–Œ | 8.63M/33.6M [00:00<00:00, 29.0MB/s]

tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s]

training_args.bin: 0%| | 0.00/4.54k [00:00<?, ?B/s]
tokenizer.model: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 500k/500k [00:00<00:00, 7.76MB/s]

training_args.bin: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4.54k/4.54k [00:00<00:00, 111kB/s]

adapter_model.bin: 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 16.0M/33.6M [00:00<00:00, 35.4MB/s]
adapter_model.bin: 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 16.0M/33.6M [00:00<00:00, 33.2MB/s]

events.out.tfevents.1697452891.s-ravivishwakarmauzio-autotrain-o3pc-15wq-atys-0-c07f1-797z75jm.113.0: 0%| | 0.00/5.05k [00:00<?, ?B/s]

optimizer.pt: 24%|β–ˆβ–ˆβ– | 16.0M/67.2M [00:00<00:01, 30.0MB/s]
events.out.tfevents.1697452891.s-ravivishwakarmauzio-autotrain-o3pc-15wq-atys-0-c07f1-797z75jm.113.0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5.05k/5.05k [00:00<00:00, 223kB/s]

tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s]

adapter_model.bin: 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 27.2M/33.6M [00:00<00:00, 57.5MB/s]
adapter_model.bin: 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 22.7M/33.6M [00:00<00:00, 42.5MB/s]

optimizer.pt: 34%|β–ˆβ–ˆβ–ˆβ–Ž | 22.7M/67.2M [00:00<00:01, 39.8MB/s]

training_args.bin: 0%| | 0.00/4.54k [00:00<?, ?B/s]
tokenizer.model: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 500k/500k [00:00<00:00, 2.50MB/s]

training_args.bin: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4.54k/4.54k [00:00<00:00, 149kB/s]

adapter_model.bin: 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 33.4M/33.6M [00:00<00:00, 49.4MB/s]
adapter_model.bin: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 33.6M/33.6M [00:00<00:00, 38.6MB/s]

optimizer.pt: 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 32.0M/67.2M [00:00<00:00, 37.7MB/s]
adapter_model.bin: 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 32.0M/33.6M [00:01<00:00, 32.0MB/s]
adapter_model.bin: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 33.6M/33.6M [00:01<00:00, 29.3MB/s]

Upload 10 LFS files: 10%|β–ˆ | 1/10 [00:01<00:12, 1.35s/it]

optimizer.pt: 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 48.0M/67.2M [00:01<00:00, 32.6MB/s]

optimizer.pt: 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 64.0M/67.2M [00:01<00:00, 43.8MB/s]
optimizer.pt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 67.2M/67.2M [00:01<00:00, 36.6MB/s]

Upload 10 LFS files: 30%|β–ˆβ–ˆβ–ˆ | 3/10 [00:01<00:04, 1.72it/s]
Upload 10 LFS files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:01<00:00, 5.06it/s]

INFO Pausing space…
error: code = NotFound desc = an error occurred when try to find container β€œ79ee03e91c012511e778c348a0fedd2d164fc1d394378d8d52fe2956d80219c0”: not found

You can ignore that error. The training was successful. The progress bar with SFT training is not indicative.

this took only 10 min, within 10 min the llama 2 model has been trained?