Transformers on GCP Training stuck on start

Hi everyone,

I have a strange problem with running transformer models on Google Cloud VM instances.

Once training should start, the program is stuck without an error message at
***** Running training *****
Num examples = 6838
Num Epochs = 5
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 4275
Number of trainable parameters = 109490699
0%| | 0/4275 [00:00<?, ?it/s]

nvidia-smi shows that the GPU is idle, top shows the process idle.

I thought I might have an error with my scripts, but even when I try one of the examples ( i.e., Google Colab ) the same error occurs.

In Colab, the scripts run without a problem. The error occurs with “deep learning VM” image and barebone Linux VM. System, python packages, all up-to-date.

I have seen some postings around the web with similar problems but no real solution. I wonder if someone figured it out?

I am running into a similar issue (also on GCP). Even though I am using a different script. Training is stuck immediately.

Did you find any solution?

This was the solution for me.

As agademic wrote, uninstalling torch_xla did the trick for me.

1 Like