I have a strange problem with running transformer models on Google Cloud VM instances.
Once training should start, the program is stuck without an error message at
***** Running training *****
Num examples = 6838
Num Epochs = 5
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 4275
Number of trainable parameters = 109490699
0%| | 0/4275 [00:00<?, ?it/s]
nvidia-smi shows that the GPU is idle, top shows the process idle.
I thought I might have an error with my scripts, but even when I try one of the examples ( i.e., Google Colab ) the same error occurs.
In Colab, the scripts run without a problem. The error occurs with “deep learning VM” image and barebone Linux VM. System, python packages, all up-to-date.
I have seen some postings around the web with similar problems but no real solution. I wonder if someone figured it out?