LLama 2 (meta-llama/Llama-2-7b-hf) fine-tunning

ravivishwakarmauzio · October 16, 2023, 10:48am

Below are logs during the finetune model.
after some time got the error, please look at the log and provide appropriate solution for this.
Thanks

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:03<00:03, 3.15s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.62s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.85s/it]
/app/env/lib/python3.9/site-packages/transformers/utils/hub.py:374: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(

Downloading (…)neration_config.json: 0%| | 0.00/188 [00:00<?, ?B/s]
Downloading (…)neration_config.json: 100%|██████████| 188/188 [00:00<00:00, 87.2kB/s]

INFO Using block size 1024
INFO creating trainer

0%| | 0/228 [00:00<?, ?it/s]You’re using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(

0%| | 1/228 [00:04<16:11, 4.28s/it]
1%| | 2/228 [00:07<13:34, 3.60s/it]
1%|▏ | 3/228 [00:10<12:42, 3.39s/it]
2%|▏ | 4/228 [00:13<12:16, 3.29s/it]
2%|▏ | 5/228 [00:16<12:00, 3.23s/it]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(

3%|▎ | 6/228 [00:20<12:04, 3.26s/it]
3%|▎ | 7/228 [00:23<11:51, 3.22s/it]
4%|▎ | 8/228 [00:26<11:41, 3.19s/it]
4%|▍ | 9/228 [00:29<11:34, 3.17s/it]
4%|▍ | 10/228 [00:32<11:28, 3.16s/it]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
{‘loss’: 1.4006, ‘learning_rate’: 0.00013043478260869567, ‘epoch’: 2.07}
{‘train_runtime’: 48.6969, ‘train_samples_per_second’: 9.364, ‘train_steps_per_second’: 4.682, ‘train_loss’: 1.4006022135416667, ‘epoch’: 2.07}

5%|▍ | 11/228 [00:35<11:36, 3.21s/it]
5%|▌ | 12/228 [00:39<11:28, 3.19s/it]
6%|▌ | 13/228 [00:42<11:21, 3.17s/it]
6%|▌ | 14/228 [00:45<11:16, 3.16s/it]
7%|▋ | 15/228 [00:48<11:11, 3.15s/it]

7%|▋ | 15/228 [00:48<11:11, 3.15s/it]

7%|▋ | 15/228 [00:48<11:11, 3.15s/it]
7%|▋ | 15/228 [00:48<11:31, 3.25s/it]

INFO Finished training, saving model…
INFO Pushing model to hub…

adapter_model.bin: 0%| | 0.00/33.6M [00:00<?, ?B/s]

rng_state.pth: 0%| | 0.00/14.2k [00:00<?, ?B/s]

Upload 10 LFS files: 0%| | 0/10 [00:00<?, ?it/s]

optimizer.pt: 0%| | 0.00/67.2M [00:00<?, ?B/s]

scheduler.pt: 0%| | 0.00/1.06k [00:00<?, ?B/s]
adapter_model.bin: 0%| | 8.19k/33.6M [00:00<08:43, 64.2kB/s]

rng_state.pth: 58%|█████▊ | 8.19k/14.2k [00:00<00:00, 63.7kB/s]

optimizer.pt: 0%| | 8.19k/67.2M [00:00<17:24, 64.3kB/s]

adapter_model.bin: 0%| | 8.19k/33.6M [00:00<09:04, 61.7kB/s]

scheduler.pt: 100%|██████████| 1.06k/1.06k [00:00<00:00, 8.23kB/s]
scheduler.pt: 100%|██████████| 1.06k/1.06k [00:00<00:00, 5.47kB/s]

adapter_model.bin: 15%|█▍ | 4.90M/33.6M [00:00<00:01, 25.8MB/s]

optimizer.pt: 8%|▊ | 5.17M/67.2M [00:00<00:02, 27.3MB/s]

adapter_model.bin: 15%|█▌ | 5.14M/33.6M [00:00<00:01, 26.6MB/s]
rng_state.pth: 100%|██████████| 14.2k/14.2k [00:00<00:00, 56.3kB/s]

optimizer.pt: 16%|█▌ | 10.6M/67.2M [00:00<00:01, 39.0MB/s]
adapter_model.bin: 26%|██▌ | 8.71M/33.6M [00:00<00:00, 29.3MB/s]

adapter_model.bin: 26%|██▌ | 8.63M/33.6M [00:00<00:00, 29.0MB/s]

tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s]

training_args.bin: 0%| | 0.00/4.54k [00:00<?, ?B/s]
tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 7.76MB/s]

training_args.bin: 100%|██████████| 4.54k/4.54k [00:00<00:00, 111kB/s]

adapter_model.bin: 48%|████▊ | 16.0M/33.6M [00:00<00:00, 35.4MB/s]
adapter_model.bin: 48%|████▊ | 16.0M/33.6M [00:00<00:00, 33.2MB/s]

events.out.tfevents.1697452891.s-ravivishwakarmauzio-autotrain-o3pc-15wq-atys-0-c07f1-797z75jm.113.0: 0%| | 0.00/5.05k [00:00<?, ?B/s]

optimizer.pt: 24%|██▍ | 16.0M/67.2M [00:00<00:01, 30.0MB/s]
events.out.tfevents.1697452891.s-ravivishwakarmauzio-autotrain-o3pc-15wq-atys-0-c07f1-797z75jm.113.0: 100%|██████████| 5.05k/5.05k [00:00<00:00, 223kB/s]

tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s]

adapter_model.bin: 81%|████████ | 27.2M/33.6M [00:00<00:00, 57.5MB/s]
adapter_model.bin: 68%|██████▊ | 22.7M/33.6M [00:00<00:00, 42.5MB/s]

optimizer.pt: 34%|███▎ | 22.7M/67.2M [00:00<00:01, 39.8MB/s]

training_args.bin: 0%| | 0.00/4.54k [00:00<?, ?B/s]
tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 2.50MB/s]

training_args.bin: 100%|██████████| 4.54k/4.54k [00:00<00:00, 149kB/s]

adapter_model.bin: 99%|█████████▉| 33.4M/33.6M [00:00<00:00, 49.4MB/s]
adapter_model.bin: 100%|██████████| 33.6M/33.6M [00:00<00:00, 38.6MB/s]

optimizer.pt: 48%|████▊ | 32.0M/67.2M [00:00<00:00, 37.7MB/s]
adapter_model.bin: 95%|█████████▌| 32.0M/33.6M [00:01<00:00, 32.0MB/s]
adapter_model.bin: 100%|██████████| 33.6M/33.6M [00:01<00:00, 29.3MB/s]

Upload 10 LFS files: 10%|█ | 1/10 [00:01<00:12, 1.35s/it]

optimizer.pt: 71%|███████▏ | 48.0M/67.2M [00:01<00:00, 32.6MB/s]

optimizer.pt: 95%|█████████▌| 64.0M/67.2M [00:01<00:00, 43.8MB/s]
optimizer.pt: 100%|██████████| 67.2M/67.2M [00:01<00:00, 36.6MB/s]

Upload 10 LFS files: 30%|███ | 3/10 [00:01<00:04, 1.72it/s]
Upload 10 LFS files: 100%|██████████| 10/10 [00:01<00:00, 5.06it/s]

INFO Pausing space…
error: code = NotFound desc = an error occurred when try to find container “79ee03e91c012511e778c348a0fedd2d164fc1d394378d8d52fe2956d80219c0”: not found

abhishek · October 16, 2023, 12:46pm

You can ignore that error. The training was successful. The progress bar with SFT training is not indicative.

ravivishwakarmauzio · October 16, 2023, 1:24pm

this took only 10 min, within 10 min the llama 2 model has been trained?

Topic		Replies	Views
Finefuning LLaMA2 model using autotrain advanced Beginners	1	755	May 5, 2024
Train huggingface Beginners	2	391	November 10, 2023
Autotrain fine tune error - trains only first 3 data sets Beginners	0	410	January 2, 2024
Training fails but no error message 🤗AutoTrain	23	2170	October 23, 2023
I was using huugginfface meta-llama/Llama-2-7b-chat-hf and im facing an error 🤗Tokenizers	2	2567	October 8, 2023

LLama 2 (meta-llama/Llama-2-7b-hf) fine-tunning

Related topics