All the training jobs end up getting stopped

Blaise-g · July 30, 2022, 8:47am

I was trying out the autotrain platform by fine tuning a model on a dataset (that was related but different to the one which was used for the fine tuning of the previous checkpoints) and I don’t understand why all models were stopped after hours of training regardless of the performance achieved in terms of metrics (with some models showing significant better results than others).

Do you have any idea on what might be going on? I have also tried to reach you guys at autonlp@huggingface.co with some more details about the id of the project but figured it was worth asking here as well in case anyone else was experiencing similar issues.

Thank you in advance for your time!
getting stopped after

ben-yu · July 31, 2022, 7:29pm

I’m experiencing the same thing. It would be great if the interface surfaced at least an error reason

maximus12793 · September 6, 2022, 5:43pm

+1 Seems that after 62000 steps my training stops for no reason.

kaane0202 · April 20, 2023, 12:48pm

Have you found a solution? My training also stops without any errors or logs

rfisherphotos · July 15, 2023, 4:26pm

Same problem here. No errors or stack traces – any way to access the logs?

michellehbn · September 27, 2023, 11:20am

Hi! Thanks for reporting and sorry for the wait. It can be expected for a Space to restart from time to time if not enough RAM and memory for the training. In cases like this, we recommend using a larger instance to help. Please let us know though if there’s any other questions! Thanks again

Adilmar · April 17, 2024, 2:10pm

@michellehbn Can this happen occasionally, and would the solution be to restart the self-training?

Topic		Replies	Views
Training stops while fine-tuning Llama2-7B with AutoTrain Advancedvanced Beginners	0	420	August 16, 2023
Data preparation in autotrain is stuck 🤗AutoTrain	9	1491	January 25, 2023
Autotrain Training always stop early (not training enough epochs) 🤗AutoTrain	0	317	March 12, 2024
I Paid 80$ and Training Job Stopped Without Explanation 🤗AutoTrain	6	113	October 16, 2024
Https://ui.autotrain.huggingface.co/36589/trainings 🤗AutoTrain	2	418	February 23, 2023

All the training jobs end up getting stopped

Related topics