Spaces Restart mid LLM Fine Tuning

uncucan · July 18, 2024, 2:35pm

At first attempt I started a fine-tuning job for an 8B LLM and I was checking the logs regularly. My space was using 4xL4s with no persistent storage. Last time I checked the training was 96% done. 2 minutes later I opened my space to check again and saw that the space has restarted without any error logs or anything. I did a little bit of research and realized it might’ve happened because of some storage issue so I upgraded my storage to 150GB and restarted the training.
After ~16 hours the same thing happened again. The space randomly just restarts without any error message or a log indicating why this has happened.

How am I supposed to debug?

chris-rannou · July 19, 2024, 10:59am

Hello uncucan,

It looks like this could be an out of m emory issue leading to a restart of the space.

We are currently working on exposing events related to a space so that this kind of errors are clearer.

In any case we cannot guarantee that a space will run without interruption for any arbitrary duration.

A good practice for such training cases is to regularly save checkpoints (in your persistent storage) to resume the training upon restarting.

Topic		Replies	Views
Space stops/restarts without any error at all Spaces	0	377	April 6, 2023
Launch timed out, space was not healthy after 30 min in AutotrAIN Spaces	1	232	December 5, 2023
Space restarts even with CPU upgrade Spaces	1	460	June 22, 2023
Mes Spaces restent bloqués sur “Starting” malgré abonnement Pro et hébergement GPU 🤗Transformers	2	38	July 14, 2025
Runtime error Launch timed out, space was not healthy after 30 min Container logs: Spaces	2	353	October 28, 2023

Spaces Restart mid LLM Fine Tuning

Related topics