All the training jobs end up getting stopped

I was trying out the autotrain platform by fine tuning a model on a dataset (that was related but different to the one which was used for the fine tuning of the previous checkpoints) and I don’t understand why all models were stopped after hours of training regardless of the performance achieved in terms of metrics (with some models showing significant better results than others).

Do you have any idea on what might be going on? I have also tried to reach you guys at with some more details about the id of the project but figured it was worth asking here as well in case anyone else was experiencing similar issues.

Thank you in advance for your time!
getting stopped after

1 Like

I’m experiencing the same thing. It would be great if the interface surfaced at least an error reason

+1 Seems that after 62000 steps my training stops for no reason.