Spaces Restart mid LLM Fine Tuning

At first attempt I started a fine-tuning job for an 8B LLM and I was checking the logs regularly. My space was using 4xL4s with no persistent storage. Last time I checked the training was 96% done. 2 minutes later I opened my space to check again and saw that the space has restarted without any error logs or anything. I did a little bit of research and realized it might’ve happened because of some storage issue so I upgraded my storage to 150GB and restarted the training.
After ~16 hours the same thing happened again. The space randomly just restarts without any error message or a log indicating why this has happened.

How am I supposed to debug?

Hello uncucan,

It looks like this could be an out of m emory issue leading to a restart of the space.

We are currently working on exposing events related to a space so that this kind of errors are clearer.

In any case we cannot guarantee that a space will run without interruption for any arbitrary duration.

A good practice for such training cases is to regularly save checkpoints (in your persistent storage) to resume the training upon restarting.