Recently, I tried to fine-tune the llama2 model by following the steps from this website
I followed all the steps that were needed and modified the batch size and the saving steps. However, after the first saving, my PC always crashed without any cause.
I’m using i9-11900kf with 128 GB RAM and 3090 for the PC.
You will need to monitor your pc’s diagnostics. One thing that comes to mind is that your CPU, GPU and RAM are all working hard for an extended period. If your operating temperature exceeds the safe level then your pc will shutdown abruptly to protect components.
Thank you for the response. I can understand your advice by monitoring the PC diagnostics. What makes the crash weird is that it only occurs when saving the model.
For example, the default parameter for saving steps is 25, then after 25 steps and saving the model, the PC crashes. However, when I set the parameter to 10, the same behavior will occur after 10 steps. If the main problem is in the hardware, I think the PC will crash when the temperature passes the limit or the PC can’t process any more tasks.
Might give you a clue. It’s probably no accident that it occurs when saving. Though to rule out a coincidence you could try running it for an extended period with saving turned off