At first attempt I started a fine-tuning job for an 8B LLM and I was checking the logs regularly. My space was using 4xL4s with no persistent storage. Last time I checked the training was 96% done. 2 minutes later I opened my space to check again and saw that the space has restarted without any error logs or anything. I did a little bit of research and realized it might’ve happened because of some storage issue so I upgraded my storage to 150GB and restarted the training.
After ~16 hours the same thing happened again. The space randomly just restarts without any error message or a log indicating why this has happened.
How am I supposed to debug?