I am trying to finetune Flan-T5 on my desktop computer, which has two Titan X GPUs.
Tensorflow seems to work fine, up to Flan-t5-base. It runs out of memory when using the large model.
However, I am getting some very weird results with the pytorch version of the model. The computer just shuts down when the network does not fit into the memory.
I used the command nvidia-smi -pl 150 reduce the power level. It seems to work when I first set up an environment. If I try to run the same script again, the computer shuts down once more!
However, if I set up a new environment from scratch it’s all good.
I don’t understand why this is happening. TensorFlow simply throughs an OOM error if there’s a problem. It’s only pytorch related.