Training fails but no error message

Hi All,
I have been following this tutorial Non-engineers guide: Train a LLaMA 2 chatbot
to fine tune an LLM (meta/opt125m) with my own data.
The training starts well, but always fails between 3-10%. I cannot see any error message that I could use to find the issue an correct it. The Space with the model training goes into “Paused” mode.

Could you help me what is thue issue?
Any points would be greatly appreciated, many thanks!

I paste the logs from 2 trainings here. This is the end of the logfile, the instances are Paused by the platform.

1st one:

300/11676 [01:01<38:19,  4.95it/s]
  3%|▎         | 301/11676 [01:02<38:17,  4.95it/s]
  3%|▎         | 302/11676 [01:02<38:17,  4.95it/s]
  3%|▎         | 303/11676 [01:02<38:17,  4.95it/s]
  3%|▎         | 304/11676 [01:02<38:18,  4.95it/s]
  3%|▎         | 305/11676 [01:02<38:17,  4.95it/s]
  3%|▎         | 306/11676 [01:03<38:15,  4.95it/s]
  3%|▎         | 307/11676 [01:03<38:14,  4.95it/s]
  3%|▎         | 308/11676 [01:03<38:14,  4.95it/s]
  3%|▎         | 309/11676 [01:03<38:15,  4.95it/s]
  3%|▎         | 310/11676 [01:03<38:16,  4.95it/s]
  3%|▎         | 311/11676 [0
 4.95it/s]
  3%|▎         | 312/11676 [01:04<38:19,  4.94it/s]
  3%|▎         | 313/11676 [01:04<38:21,  4.94it/s]
  3%|▎         | 314/11676 [01:04<38:19,  4.94it/s]
  3%|▎         | 315/11676 [01:04<38:17,  4.94it/s]
  3%|▎         | 316/11676 [01:05<38:15,  4.95it/s]
  3%|▎         | 317/11676 [01:05<38:17,  4.94it/s]
  3%|▎         | 318/11676 [01:05<38:17,  4.94it/s]
  3%|▎         | 319/11676 [01:05<38:16,  4.95it/s]
  3%|▎         | 320/11676 [01:05<38:14,  4.95it/s]
  3%|▎         | 321/11676 [01:06<38:13,  4.95it/s]
  3%|▎         | 322/11676 [01:06<38:14,  4.95it/s]
  3%|▎         | 323/11676 [01:06<38:16,  4.94it/s]
  3%|▎         | 324/11676 [01:06<38:15,  4.95it/s]
  3%|▎         | 325/11676 [01:06<38:13,  4.95it/s]
  3%|▎         | 326/11676 [01:07<38:11,  4.95it/s]
  3%|▎         | 327/11676 [01:07<38:12,  4.95it/s]
  3%|▎         | 328/11676 [01:07<38:14,  4.95it/s]
  3%|▎         | 329/11676 [01:07<38:45,  4.88it/s]
  3%|▎         | 330/11676 [01:07<33:28,  5.65it/s]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
INFO:     10.16.38.82:50124 - "GET /?logs=container&__sign=eyJhbGciOiJFZERTQSJ9.eyJpYXQiOjE2OTY2ODQxODIsInN1YiI6IlJlZDVSZWQ1L2F1dG90cmFpbi1kYXRhcGFydDA5LW9wdDEyNS0wIiwiZXhwIjoxNjk2NzcwNTgyLCJpc3MiOiJodHRwczovL2h1Z2dpbmdmYWNlLmNvIn0.2acH2ViyMIqWcwNHjg2frZ5xntfUEbPyEma7rcD1iUwAnO4oacx1o022SVrek3bTbZWkW0qyMJgrjSGOQJjuBw HTTP/1.1" 200 OK

2nd one

[02:06<37:07,  4.96it/s]
  5%|▌         | 620/11676 [02:07<37:06,  4.97it/s]
  5%|▌         | 621/11676 [02:07<37:06,  4.97it/s]
  5%|▌         | 622/11676 [02:07<37:06,  4.96it/s]
  5%|▌         | 623/11676 [02:07<37:07,  4.96it/s]
  5%|▌         | 624/11676 [02:07<37:06,  4.96it/s]
  5%|▌         | 625/11676 [02:08<37:06,  4.96it/s]
  5%|▌         | 626/11676 [02:08<37:07,  4.96it/s]
  5%|▌         | 627/11676 [02:08<37:07,  4.96it/s]
  5%|▌         | 628/11676 [02:08<37:07,  4.96it/s]
  5%|▌         | 629/11676 [02:08<37:06,  4.96it/s]
  5%|▌         | 630/11676 [02:09<37:06,  4.96it/s]
  5%|▌         | 631/11676 [02:09<37:06,  4.96it/s]
  5%|▌         | 632/11676 [02:09<37:05,  4.96it/s]
  5%|▌         | 633/11676 [02:09<37:05,  4.96it/s]
  5%|▌         | 634/11676 [02:09<37:04,  4.96it/s]
  5%|▌         | 635/11676 [02:10<37:06,  4.96it/s]
  5%|▌         | 636/11676 [02:10<37:05,  4.96it/s]
  5%|▌         | 637/11676 [02:10<37:05,  4.96it/s]
  5%|▌         | 638/11676 [02:10<37:04,  4.96it/s]
  5%|▌         | 639/11676 [02:10<37:03,  4.96it/s]
  5%|▌         | 640/11676 [02:11<37:06,  4.96it/s]
  5%|▌         | 641/11676 [02:11<37:06,  4.96it/s]
  5%|▌         | 642/11676 [02:11<37:06,  4.96it/s]
  6%|▌         | 643/11676 [02:11<37:05,  4.96it/s]
  6%|▌       
 | 644/11676 [02:11<37:04,  4.96it/s]
  6%|▌         | 645/11676 [02:12<37:04,  4.96it/s]
  6%|▌         | 646/11676 [02:12<37:04,  4.96it/s]
  6%|▌         | 647/11676 [02:12<37:03,  4.96it/s]
  6%|▌         | 648/11676 [02:12<37:02,  4.96it/s]
  6%|▌         | 649/11676 [02:12<37:01,  4.96it/s]
  6%|▌         | 650/11676 [02:13<37:01,  4.96it/s]
  6%|▌         | 651/11676 [02:13<37:01,  4.96it/s]
  6%|▌         | 652/11676 [02:13<37:00,  4.96it/s]
  6%|▌         | 653/11676 [02:13<37:00,  4.96it/s]
  6%|▌         | 654/11676 [02:13<37:00,  4.96it/s]
  6%|▌         | 655/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 656/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 657/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 658/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 659/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 660/11676 [02:15<37:00,  4.96it/s]
  6%|▌         | 661/11676 [02:15<37:01,  4.96it/s]
  6%|▌         | 662/11676 [02:15<37:01,  4.96it/s]
  6%|▌         | 663/11676 [02:15<36:59,  4.96it/s]
  6%|▌         | 664/11676 [02:15<36:59,  4.96it/s]
  6%|▌         | 665/11676 [02:16<36:58,  4.96it/s]
  6%|▌         | 666/11676 [02:16<36:59,  4.96it/s]
  6%|▌         | 667/11676 [02:16<36:59,  4.96it/s]
  6%|▌         | 668/11676 [02:16<36:58,  4.96it/s]
  6%|▌         | 669/11676 [02:16<36:57,  4.96it/s]
  6%|▌         | 670/11676 [02:17<36:57,  4.96it/s]
  6%|▌         | 671/11676 [02:17<36:56,  4.96it/s]
  6%|▌         | 672/11676 [02:17<36:56,  4.97it/s]
  6%|▌         | 673/11676 [02:17<36:55,  4.97it/s]
  6%|▌         | 674/11676 [02:17<36:55,  4.97it/s]
  6%|▌         | 675/11676 [02:18<36:55,  4.97it/s]
  6%|▌         | 676/11676 [02:18<36:54,  4.97it/s]
  6%|▌         | 677/11676 [02:18<36:54,  4.97it/s]
  6%|▌         | 678/11676 [02:18<36:54,  4.97it/s]
  6%|▌         | 679/11676 [02:18<36:53,  4.97it/s]
  6%|▌         | 680/11676 [02:19<37:23,  4.90it/s]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(

Seems to be going out of memory. Which machine did you use?

1 Like

A10G Large in all cases

I tried to use the A100, but I always get an error that my account is not eligible.

Is it possible to get the A100 enabled? Or would you recommend using a smaller model?
The meta/opt125m is anyway one of the smaller options in the selection

My own dataset is about 3-4mb, so not much at all

thats a small model. are you sure the trained model is not in your profile? the training space pauses on its own when the training is done

2 Likes

I am positive, the model did not complete the training.

I have kept the browser with open with the logs scrolling as the training went on, and it always suddenly stops always between 3-10% completion. I tried multiple models, multiple training datasets. Always leaving all options on default, as suggested by the tutorial above.

If you are using SFT, the progress bar is not representative. The only way to confirm if it finished is to check your hf account for new model repos.

1 Like

I have the same issue. I’ve wasted $30 at this point and gotten nowhere. I get a model repo with a folder called “checkpoint - xxx” which doesn’t seem complete and will not work with a chat ui as outlined in the tutorial. I’ve followed the tutorial provided by huggingface word for word and it doesn’t work.

thanks so much! that’s amazingly helpful. The models were acutaly there on under my HF account, but I didn’t understand that they were actually ready

Now I want to do the ChatUI part. I launched the ChatUI docker instance, based on a model I successfully trained

But now I get this error:

OSError: Red5Red5/arc-ai-v01-0 does not appear to have a file named config.json. Checkout ‘https://huggingface.co/Red5Red5/arc-ai-v01-0/main’ for available files.

There is indeed no config.json file. The model is based on “meta-llama/Llama-2-7b-chat-hf”.
How can I generate a config.json file?

Thank you again for the help!

Try clicking your HF profile on the top-right corner, click on your own username.
Then scroll below the “Spaces” section.
Under the Models section do you see anything? That’s where I found my precious models that I trained for good $$$
Kind of wish there was a little notification that the AutoTraining was successful and you can find your model in the Models section of your account

1 Like

Hi Red, Yes, but I don’t think the model completed which is why we’re getting the config.json error. I’m assuming the checkpoint folder in the model repo is where the model training “paused” before completion.

@abhishek (or anyone) can you confirm if an A10G large is actually powerful enough to autotrain Llama 7b on a 52k row dataset as described in your tutorial? Non-engineers guide: Train a LLaMA 2 chatbot

@mortsn not finding config.json is not an issue. the model was trained successfully, but since its an adapter model, we need to merge the adapter. seems like this step is missing from the blog post. ill update it.

Thanks you!

@abhishek are you still planning to update the blog post? I’d love to continue working with autotrain, but am unsure how to merge the adapter. Thanks.

If you use --merge-adapter param during training, it will be merged when training finishes. ill update blog post tomorrow

1 Like

Hi @abhishek I don’t believe there is an option to set the --merge-adapter parameter in the autotrain tool. Are you still planning to update the blog post?

Sorry for my late response. You can use this space to merge adapter for models trained using autotrain ui: Llm Merge Adapter - a Hugging Face Space by autotrain-projects. Please let me know if you have any questions.

1 Like

Thank you @abhishek! It seems like I was able to merge the model successfully but now the Chat UI step seems broken. It gets stuck in a loop of “Failed to connect to 127.0.0.1 port 8080: Connection refused” when building.

There are multiple threads reporting this issue and it looks like your own space is having the issue. Is there an issue with Chat UI or does this mean something is wrong with the model?

thread 1
thread 2
space

Thanks

Are you attaching gpu of appropriate size for the model?

@abhishek yes I get this error whether I use the recommended for ChatUi A10 small or try to go higher to the A10 large.

It does give this error before the failed to connect: ValueError: Expected target PEFT class: PeftModelForCausalLM, but you have asked for: PeftModelForSeq2SeqLM make sure that you are loading the correct model for your task type.

Could that be the issue? Is there a way to specify to use “PeftModelForCausalLM”?