Training fails but no error message

Red5Red5 · October 7, 2023, 1:29pm

Hi All,
I have been following this tutorial Non-engineers guide: Train a LLaMA 2 chatbot
to fine tune an LLM (meta/opt125m) with my own data.
The training starts well, but always fails between 3-10%. I cannot see any error message that I could use to find the issue an correct it. The Space with the model training goes into “Paused” mode.

Could you help me what is thue issue?
Any points would be greatly appreciated, many thanks!

I paste the logs from 2 trainings here. This is the end of the logfile, the instances are Paused by the platform.

1st one:

300/11676 [01:01<38:19,  4.95it/s]
  3%|▎         | 301/11676 [01:02<38:17,  4.95it/s]
  3%|▎         | 302/11676 [01:02<38:17,  4.95it/s]
  3%|▎         | 303/11676 [01:02<38:17,  4.95it/s]
  3%|▎         | 304/11676 [01:02<38:18,  4.95it/s]
  3%|▎         | 305/11676 [01:02<38:17,  4.95it/s]
  3%|▎         | 306/11676 [01:03<38:15,  4.95it/s]
  3%|▎         | 307/11676 [01:03<38:14,  4.95it/s]
  3%|▎         | 308/11676 [01:03<38:14,  4.95it/s]
  3%|▎         | 309/11676 [01:03<38:15,  4.95it/s]
  3%|▎         | 310/11676 [01:03<38:16,  4.95it/s]
  3%|▎         | 311/11676 [0
 4.95it/s]
  3%|▎         | 312/11676 [01:04<38:19,  4.94it/s]
  3%|▎         | 313/11676 [01:04<38:21,  4.94it/s]
  3%|▎         | 314/11676 [01:04<38:19,  4.94it/s]
  3%|▎         | 315/11676 [01:04<38:17,  4.94it/s]
  3%|▎         | 316/11676 [01:05<38:15,  4.95it/s]
  3%|▎         | 317/11676 [01:05<38:17,  4.94it/s]
  3%|▎         | 318/11676 [01:05<38:17,  4.94it/s]
  3%|▎         | 319/11676 [01:05<38:16,  4.95it/s]
  3%|▎         | 320/11676 [01:05<38:14,  4.95it/s]
  3%|▎         | 321/11676 [01:06<38:13,  4.95it/s]
  3%|▎         | 322/11676 [01:06<38:14,  4.95it/s]
  3%|▎         | 323/11676 [01:06<38:16,  4.94it/s]
  3%|▎         | 324/11676 [01:06<38:15,  4.95it/s]
  3%|▎         | 325/11676 [01:06<38:13,  4.95it/s]
  3%|▎         | 326/11676 [01:07<38:11,  4.95it/s]
  3%|▎         | 327/11676 [01:07<38:12,  4.95it/s]
  3%|▎         | 328/11676 [01:07<38:14,  4.95it/s]
  3%|▎         | 329/11676 [01:07<38:45,  4.88it/s]
  3%|▎         | 330/11676 [01:07<33:28,  5.65it/s]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
INFO:     10.16.38.82:50124 - "GET /?logs=container&__sign=eyJhbGciOiJFZERTQSJ9.eyJpYXQiOjE2OTY2ODQxODIsInN1YiI6IlJlZDVSZWQ1L2F1dG90cmFpbi1kYXRhcGFydDA5LW9wdDEyNS0wIiwiZXhwIjoxNjk2NzcwNTgyLCJpc3MiOiJodHRwczovL2h1Z2dpbmdmYWNlLmNvIn0.2acH2ViyMIqWcwNHjg2frZ5xntfUEbPyEma7rcD1iUwAnO4oacx1o022SVrek3bTbZWkW0qyMJgrjSGOQJjuBw HTTP/1.1" 200 OK

2nd one

[02:06<37:07,  4.96it/s]
  5%|▌         | 620/11676 [02:07<37:06,  4.97it/s]
  5%|▌         | 621/11676 [02:07<37:06,  4.97it/s]
  5%|▌         | 622/11676 [02:07<37:06,  4.96it/s]
  5%|▌         | 623/11676 [02:07<37:07,  4.96it/s]
  5%|▌         | 624/11676 [02:07<37:06,  4.96it/s]
  5%|▌         | 625/11676 [02:08<37:06,  4.96it/s]
  5%|▌         | 626/11676 [02:08<37:07,  4.96it/s]
  5%|▌         | 627/11676 [02:08<37:07,  4.96it/s]
  5%|▌         | 628/11676 [02:08<37:07,  4.96it/s]
  5%|▌         | 629/11676 [02:08<37:06,  4.96it/s]
  5%|▌         | 630/11676 [02:09<37:06,  4.96it/s]
  5%|▌         | 631/11676 [02:09<37:06,  4.96it/s]
  5%|▌         | 632/11676 [02:09<37:05,  4.96it/s]
  5%|▌         | 633/11676 [02:09<37:05,  4.96it/s]
  5%|▌         | 634/11676 [02:09<37:04,  4.96it/s]
  5%|▌         | 635/11676 [02:10<37:06,  4.96it/s]
  5%|▌         | 636/11676 [02:10<37:05,  4.96it/s]
  5%|▌         | 637/11676 [02:10<37:05,  4.96it/s]
  5%|▌         | 638/11676 [02:10<37:04,  4.96it/s]
  5%|▌         | 639/11676 [02:10<37:03,  4.96it/s]
  5%|▌         | 640/11676 [02:11<37:06,  4.96it/s]
  5%|▌         | 641/11676 [02:11<37:06,  4.96it/s]
  5%|▌         | 642/11676 [02:11<37:06,  4.96it/s]
  6%|▌         | 643/11676 [02:11<37:05,  4.96it/s]
  6%|▌       
 | 644/11676 [02:11<37:04,  4.96it/s]
  6%|▌         | 645/11676 [02:12<37:04,  4.96it/s]
  6%|▌         | 646/11676 [02:12<37:04,  4.96it/s]
  6%|▌         | 647/11676 [02:12<37:03,  4.96it/s]
  6%|▌         | 648/11676 [02:12<37:02,  4.96it/s]
  6%|▌         | 649/11676 [02:12<37:01,  4.96it/s]
  6%|▌         | 650/11676 [02:13<37:01,  4.96it/s]
  6%|▌         | 651/11676 [02:13<37:01,  4.96it/s]
  6%|▌         | 652/11676 [02:13<37:00,  4.96it/s]
  6%|▌         | 653/11676 [02:13<37:00,  4.96it/s]
  6%|▌         | 654/11676 [02:13<37:00,  4.96it/s]
  6%|▌         | 655/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 656/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 657/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 658/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 659/11676 [02:14<37:00,  4.96it/s]
  6%|▌         | 660/11676 [02:15<37:00,  4.96it/s]
  6%|▌         | 661/11676 [02:15<37:01,  4.96it/s]
  6%|▌         | 662/11676 [02:15<37:01,  4.96it/s]
  6%|▌         | 663/11676 [02:15<36:59,  4.96it/s]
  6%|▌         | 664/11676 [02:15<36:59,  4.96it/s]
  6%|▌         | 665/11676 [02:16<36:58,  4.96it/s]
  6%|▌         | 666/11676 [02:16<36:59,  4.96it/s]
  6%|▌         | 667/11676 [02:16<36:59,  4.96it/s]
  6%|▌         | 668/11676 [02:16<36:58,  4.96it/s]
  6%|▌         | 669/11676 [02:16<36:57,  4.96it/s]
  6%|▌         | 670/11676 [02:17<36:57,  4.96it/s]
  6%|▌         | 671/11676 [02:17<36:56,  4.96it/s]
  6%|▌         | 672/11676 [02:17<36:56,  4.97it/s]
  6%|▌         | 673/11676 [02:17<36:55,  4.97it/s]
  6%|▌         | 674/11676 [02:17<36:55,  4.97it/s]
  6%|▌         | 675/11676 [02:18<36:55,  4.97it/s]
  6%|▌         | 676/11676 [02:18<36:54,  4.97it/s]
  6%|▌         | 677/11676 [02:18<36:54,  4.97it/s]
  6%|▌         | 678/11676 [02:18<36:54,  4.97it/s]
  6%|▌         | 679/11676 [02:18<36:53,  4.97it/s]
  6%|▌         | 680/11676 [02:19<37:23,  4.90it/s]/app/env/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(

abhishek · October 7, 2023, 2:33pm

Seems to be going out of memory. Which machine did you use?

Red5Red5 · October 7, 2023, 2:47pm

A10G Large in all cases

I tried to use the A100, but I always get an error that my account is not eligible.

Is it possible to get the A100 enabled? Or would you recommend using a smaller model?
The meta/opt125m is anyway one of the smaller options in the selection

My own dataset is about 3-4mb, so not much at all

abhishek · October 7, 2023, 2:50pm

thats a small model. are you sure the trained model is not in your profile? the training space pauses on its own when the training is done

Red5Red5 · October 7, 2023, 2:55pm

I am positive, the model did not complete the training.

I have kept the browser with open with the logs scrolling as the training went on, and it always suddenly stops always between 3-10% completion. I tried multiple models, multiple training datasets. Always leaving all options on default, as suggested by the tutorial above.

abhishek · October 7, 2023, 3:07pm

If you are using SFT, the progress bar is not representative. The only way to confirm if it finished is to check your hf account for new model repos.

mortsn · October 7, 2023, 6:16pm

I have the same issue. I’ve wasted $30 at this point and gotten nowhere. I get a model repo with a folder called “checkpoint - xxx” which doesn’t seem complete and will not work with a chat ui as outlined in the tutorial. I’ve followed the tutorial provided by huggingface word for word and it doesn’t work.

Red5Red5 · October 7, 2023, 6:27pm

thanks so much! that’s amazingly helpful. The models were acutaly there on under my HF account, but I didn’t understand that they were actually ready

Now I want to do the ChatUI part. I launched the ChatUI docker instance, based on a model I successfully trained

But now I get this error:

OSError: Red5Red5/arc-ai-v01-0 does not appear to have a file named config.json. Checkout ‘https://huggingface.co/Red5Red5/arc-ai-v01-0/main’ for available files.

There is indeed no config.json file. The model is based on “meta-llama/Llama-2-7b-chat-hf”.
How can I generate a config.json file?

Thank you again for the help!

Red5Red5 · October 7, 2023, 6:29pm

Try clicking your HF profile on the top-right corner, click on your own username.
Then scroll below the “Spaces” section.
Under the Models section do you see anything? That’s where I found my precious models that I trained for good $$$
Kind of wish there was a little notification that the AutoTraining was successful and you can find your model in the Models section of your account

mortsn · October 7, 2023, 6:36pm

Hi Red, Yes, but I don’t think the model completed which is why we’re getting the config.json error. I’m assuming the checkpoint folder in the model repo is where the model training “paused” before completion.

mortsn · October 8, 2023, 11:33pm

@abhishek (or anyone) can you confirm if an A10G large is actually powerful enough to autotrain Llama 7b on a 52k row dataset as described in your tutorial? Non-engineers guide: Train a LLaMA 2 chatbot

abhishek · October 10, 2023, 8:33am

@mortsn not finding config.json is not an issue. the model was trained successfully, but since its an adapter model, we need to merge the adapter. seems like this step is missing from the blog post. ill update it.

mortsn · October 10, 2023, 5:53pm

Thanks you!

mortsn · October 16, 2023, 6:28pm

@abhishek are you still planning to update the blog post? I’d love to continue working with autotrain, but am unsure how to merge the adapter. Thanks.

abhishek · October 16, 2023, 6:47pm

If you use --merge-adapter param during training, it will be merged when training finishes. ill update blog post tomorrow

mortsn · October 20, 2023, 5:50pm

Hi @abhishek I don’t believe there is an option to set the --merge-adapter parameter in the autotrain tool. Are you still planning to update the blog post?

abhishek · October 21, 2023, 9:47am

Sorry for my late response. You can use this space to merge adapter for models trained using autotrain ui: Llm Merge Adapter - a Hugging Face Space by autotrain-projects. Please let me know if you have any questions.

mortsn · October 21, 2023, 6:02pm

Thank you @abhishek! It seems like I was able to merge the model successfully but now the Chat UI step seems broken. It gets stuck in a loop of “Failed to connect to 127.0.0.1 port 8080: Connection refused” when building.

There are multiple threads reporting this issue and it looks like your own space is having the issue. Is there an issue with Chat UI or does this mean something is wrong with the model?

thread 1
thread 2
space

Thanks

abhishek · October 21, 2023, 6:14pm

Are you attaching gpu of appropriate size for the model?

mortsn · October 21, 2023, 7:42pm

@abhishek yes I get this error whether I use the recommended for ChatUi A10 small or try to go higher to the A10 large.

It does give this error before the failed to connect: ValueError: Expected target PEFT class: PeftModelForCausalLM, but you have asked for: PeftModelForSeq2SeqLM make sure that you are loading the correct model for your task type.

Could that be the issue? Is there a way to specify to use “PeftModelForCausalLM”?

Topic		Replies	Views
Error merge_adapter on autotrain? 🤗AutoTrain	0	448	February 12, 2024
LLama 2 (meta-llama/Llama-2-7b-hf) fine-tunning 🤗AutoTrain	2	3382	October 16, 2023
Fine tuned Mistral-7B-Instruct-1.0 inference missing config.json Models	2	2674	November 22, 2023
Failing to start a ChatUI with a model Beginners	0	187	November 4, 2023
Missing config.json file after AutoTraining Beginners	7	8358	April 10, 2024

Training fails but no error message

Related topics