All the training jobs end up getting stopped

I was trying out the autotrain platform by fine tuning a model on a dataset (that was related but different to the one which was used for the fine tuning of the previous checkpoints) and I don’t understand why all models were stopped after hours of training regardless of the performance achieved in terms of metrics (with some models showing significant better results than others).

Do you have any idea on what might be going on? I have also tried to reach you guys at autonlp@huggingface.co with some more details about the id of the project but figured it was worth asking here as well in case anyone else was experiencing similar issues.

Thank you in advance for your time!
getting stopped after

3 Likes

I’m experiencing the same thing. It would be great if the interface surfaced at least an error reason

+1 Seems that after 62000 steps my training stops for no reason.

Have you found a solution? My training also stops without any errors or logs

Same problem here. No errors or stack traces – any way to access the logs?

Hi! Thanks for reporting and sorry for the wait. It can be expected for a Space to restart from time to time if not enough RAM and memory for the training. In cases like this, we recommend using a larger instance to help. Please let us know though if there’s any other questions! Thanks again

@michellehbn Can this happen occasionally, and would the solution be to restart the self-training?