Hi All,
I have been following this tutorial Non-engineers guide: Train a LLaMA 2 chatbot
to fine tune an LLM (meta/opt125m) with my own data.
The training starts well, but always fails between 3-10%. I cannot see any error message that I could use to find the issue an correct it. The Space with the model training goes into “Paused” mode.
Could you help me what is thue issue?
Any points would be greatly appreciated, many thanks!
I paste the logs from 2 trainings here. This is the end of the logfile, the instances are Paused by the platform.
1st one:
300/11676 [01:01<38:19, 4.95it/s]
3%|▎ | 301/11676 [01:02<38:17, 4.95it/s]
3%|▎ | 302/11676 [01:02<38:17, 4.95it/s]
3%|▎ | 303/11676 [01:02<38:17, 4.95it/s]
3%|▎ | 304/11676 [01:02<38:18, 4.95it/s]
3%|▎ | 305/11676 [01:02<38:17, 4.95it/s]
3%|▎ | 306/11676 [01:03<38:15, 4.95it/s]
3%|▎ | 307/11676 [01:03<38:14, 4.95it/s]
3%|▎ | 308/11676 [01:03<38:14, 4.95it/s]
3%|▎ | 309/11676 [01:03<38:15, 4.95it/s]
3%|▎ | 310/11676 [01:03<38:16, 4.95it/s]
3%|▎ | 311/11676 [0
4.95it/s]
3%|▎ | 312/11676 [01:04<38:19, 4.94it/s]
3%|▎ | 313/11676 [01:04<38:21, 4.94it/s]
3%|▎ | 314/11676 [01:04<38:19, 4.94it/s]
3%|▎ | 315/11676 [01:04<38:17, 4.94it/s]
3%|▎ | 316/11676 [01:05<38:15, 4.95it/s]
3%|▎ | 317/11676 [01:05<38:17, 4.94it/s]
3%|▎ | 318/11676 [01:05<38:17, 4.94it/s]
3%|▎ | 319/11676 [01:05<38:16, 4.95it/s]
3%|▎ | 320/11676 [01:05<38:14, 4.95it/s]
3%|▎ | 321/11676 [01:06<38:13, 4.95it/s]
3%|▎ | 322/11676 [01:06<38:14, 4.95it/s]
3%|▎ | 323/11676 [01:06<38:16, 4.94it/s]
3%|▎ | 324/11676 [01:06<38:15, 4.95it/s]
3%|▎ | 325/11676 [01:06<38:13, 4.95it/s]
3%|▎ | 326/11676 [01:07<38:11, 4.95it/s]
3%|▎ | 327/11676 [01:07<38:12, 4.95it/s]
3%|▎ | 328/11676 [01:07<38:14, 4.95it/s]
3%|▎ | 329/11676 [01:07<38:45, 4.88it/s]
3%|▎ | 330/11676 [01:07<33:28, 5.65it/s]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
INFO: 10.16.38.82:50124 - "GET /?logs=container&__sign=eyJhbGciOiJFZERTQSJ9.eyJpYXQiOjE2OTY2ODQxODIsInN1YiI6IlJlZDVSZWQ1L2F1dG90cmFpbi1kYXRhcGFydDA5LW9wdDEyNS0wIiwiZXhwIjoxNjk2NzcwNTgyLCJpc3MiOiJodHRwczovL2h1Z2dpbmdmYWNlLmNvIn0.2acH2ViyMIqWcwNHjg2frZ5xntfUEbPyEma7rcD1iUwAnO4oacx1o022SVrek3bTbZWkW0qyMJgrjSGOQJjuBw HTTP/1.1" 200 OK
2nd one
[02:06<37:07, 4.96it/s]
5%|▌ | 620/11676 [02:07<37:06, 4.97it/s]
5%|▌ | 621/11676 [02:07<37:06, 4.97it/s]
5%|▌ | 622/11676 [02:07<37:06, 4.96it/s]
5%|▌ | 623/11676 [02:07<37:07, 4.96it/s]
5%|▌ | 624/11676 [02:07<37:06, 4.96it/s]
5%|▌ | 625/11676 [02:08<37:06, 4.96it/s]
5%|▌ | 626/11676 [02:08<37:07, 4.96it/s]
5%|▌ | 627/11676 [02:08<37:07, 4.96it/s]
5%|▌ | 628/11676 [02:08<37:07, 4.96it/s]
5%|▌ | 629/11676 [02:08<37:06, 4.96it/s]
5%|▌ | 630/11676 [02:09<37:06, 4.96it/s]
5%|▌ | 631/11676 [02:09<37:06, 4.96it/s]
5%|▌ | 632/11676 [02:09<37:05, 4.96it/s]
5%|▌ | 633/11676 [02:09<37:05, 4.96it/s]
5%|▌ | 634/11676 [02:09<37:04, 4.96it/s]
5%|▌ | 635/11676 [02:10<37:06, 4.96it/s]
5%|▌ | 636/11676 [02:10<37:05, 4.96it/s]
5%|▌ | 637/11676 [02:10<37:05, 4.96it/s]
5%|▌ | 638/11676 [02:10<37:04, 4.96it/s]
5%|▌ | 639/11676 [02:10<37:03, 4.96it/s]
5%|▌ | 640/11676 [02:11<37:06, 4.96it/s]
5%|▌ | 641/11676 [02:11<37:06, 4.96it/s]
5%|▌ | 642/11676 [02:11<37:06, 4.96it/s]
6%|▌ | 643/11676 [02:11<37:05, 4.96it/s]
6%|▌
| 644/11676 [02:11<37:04, 4.96it/s]
6%|▌ | 645/11676 [02:12<37:04, 4.96it/s]
6%|▌ | 646/11676 [02:12<37:04, 4.96it/s]
6%|▌ | 647/11676 [02:12<37:03, 4.96it/s]
6%|▌ | 648/11676 [02:12<37:02, 4.96it/s]
6%|▌ | 649/11676 [02:12<37:01, 4.96it/s]
6%|▌ | 650/11676 [02:13<37:01, 4.96it/s]
6%|▌ | 651/11676 [02:13<37:01, 4.96it/s]
6%|▌ | 652/11676 [02:13<37:00, 4.96it/s]
6%|▌ | 653/11676 [02:13<37:00, 4.96it/s]
6%|▌ | 654/11676 [02:13<37:00, 4.96it/s]
6%|▌ | 655/11676 [02:14<37:00, 4.96it/s]
6%|▌ | 656/11676 [02:14<37:00, 4.96it/s]
6%|▌ | 657/11676 [02:14<37:00, 4.96it/s]
6%|▌ | 658/11676 [02:14<37:00, 4.96it/s]
6%|▌ | 659/11676 [02:14<37:00, 4.96it/s]
6%|▌ | 660/11676 [02:15<37:00, 4.96it/s]
6%|▌ | 661/11676 [02:15<37:01, 4.96it/s]
6%|▌ | 662/11676 [02:15<37:01, 4.96it/s]
6%|▌ | 663/11676 [02:15<36:59, 4.96it/s]
6%|▌ | 664/11676 [02:15<36:59, 4.96it/s]
6%|▌ | 665/11676 [02:16<36:58, 4.96it/s]
6%|▌ | 666/11676 [02:16<36:59, 4.96it/s]
6%|▌ | 667/11676 [02:16<36:59, 4.96it/s]
6%|▌ | 668/11676 [02:16<36:58, 4.96it/s]
6%|▌ | 669/11676 [02:16<36:57, 4.96it/s]
6%|▌ | 670/11676 [02:17<36:57, 4.96it/s]
6%|▌ | 671/11676 [02:17<36:56, 4.96it/s]
6%|▌ | 672/11676 [02:17<36:56, 4.97it/s]
6%|▌ | 673/11676 [02:17<36:55, 4.97it/s]
6%|▌ | 674/11676 [02:17<36:55, 4.97it/s]
6%|▌ | 675/11676 [02:18<36:55, 4.97it/s]
6%|▌ | 676/11676 [02:18<36:54, 4.97it/s]
6%|▌ | 677/11676 [02:18<36:54, 4.97it/s]
6%|▌ | 678/11676 [02:18<36:54, 4.97it/s]
6%|▌ | 679/11676 [02:18<36:53, 4.97it/s]
6%|▌ | 680/11676 [02:19<37:23, 4.90it/s]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(