Hello all, discovered the wonderful world of huggingface recently and am doing some experiments regarding fine-tuning large language models. I am having an issue where my rig suddenly powers off in the middle of training with heavy dual GPU usage.
I am able to do 1 epoch of training gpt-neo-125M with a 6,600 line text file, however, more than 1 epoch results in the rig suddenly powering off/halting in the middle of training.
Not sure if this is just a general gaming rig-type issue, or is it possible there is a bug in some driver that is being used? Maybe an nvidia driver or cuda or something like that? I have checked all the usual factors and i dont see evidence of anything being overloaded or overheating.
The setup is:
Preexisting, been using for 4 years:
large coolermaster case
asus zenith extreme alpha x399 motherboard
threadripper 2920
New for ML:
two nvidia 3090 GPUs
power supply - corsair h1500i, 1500w
-64 gb ram in 16x4 setup (all new and matched)
-samsung evo nvme drive for an ml dev environment
-new install of ubuntu 22.04 for dedicated ML boot environment on the samsung nvme
-using conda for python environment management
When I train the gpt-neo-125m model for one epoch on the 6,000 line input file, using the transformers .train() method, it used both GPUs (to my surprise, without special config), and the training run seemed to maybe have a mild good effect on the test outputs.
However when I take the next step and try to train for multiple epochs, the computer suddenly powers off in the middle of training. This happened three times sequentially. I cannot even do 5 epochs. in the midst of training, suddenly most things in the rig lose power suddenly, though the motherboard info display screen stays lit up and the motherboard rgb stays lit. (So if it was a psu issue it didnt cut off every rail.)
The obvious answer is, some resource is overheating or overloaded, but i dont see evidence of that. I reproduced while monitoring all temperatures, nothing is out of spec. The threadripper goes to 74 degrees Tctl and Tdie is 57c. (I read that Tctl is calculated by Tdie + 17c so this is consistent with that.) The gpus report not being hot at all, dont remember numbers but T well within specs.
I looked through the ubuntu logs and dont see anything interesting, seems to just stop logging suddenly. Looking in /var/log/syslog, i have identified the power off and reboot points and i dont see anything that looks interesting. i ran a “journalctl -g ‘temperature|critical’” and found nothing about increased temperatures. It is just business as usual in the logs, with different things happening at each of the power-off event points, and a bunch of "NUL"s suddenly appearing in the log stream as the last thing before the reboot process is logged. Apparently this happens when the journaling file system prepares to write new data to the area that shows up as “NULs” but is suddenly aborted expectedly without being able to do it.
I wanted to get info from the PSU about what was happening, but unfortunately corsair does not make a tool for ubuntu and the historical opencorsairlink package doesnt work any more (init 99 error as others have observed on the googles).
I tested the ram with windows memory diagnostics, which didnt find any problems.
Anyway im wondering if anyone has experience with this and what suggestions there would be to troubleshoot further.
Thanks for any ideas!