Transformers on GCP Training stuck on start

PhilW · March 4, 2023, 4:36pm

Hi everyone,

I have a strange problem with running transformer models on Google Cloud VM instances.

Once training should start, the program is stuck without an error message at
***** Running training *****
Num examples = 6838
Num Epochs = 5
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 4275
Number of trainable parameters = 109490699
0%| | 0/4275 [00:00<?, ?it/s]

nvidia-smi shows that the GPU is idle, top shows the process idle.

I thought I might have an error with my scripts, but even when I try one of the examples ( i.e., Google Colab ) the same error occurs.

In Colab, the scripts run without a problem. The error occurs with “deep learning VM” image and barebone Linux VM. System, python packages, all up-to-date.

I have seen some postings around the web with similar problems but no real solution. I wonder if someone figured it out?

agademic · March 17, 2023, 4:09pm

I am running into a similar issue (also on GCP). Even though I am using a different script. Training is stuck immediately.

Did you find any solution?

agademic · March 17, 2023, 4:35pm

This was the solution for me.

PhilW · March 22, 2023, 6:41pm

As agademic wrote, uninstalling torch_xla did the trick for me.

Topic		Replies	Views
Trainer.train stalls when using tutorial example on GCP instance Beginners	2	905	February 15, 2023
Trainer.train() is stuck 🤗Transformers	5	7539	May 1, 2023
Training of GPT hang during Checkpoint stage 🤗Transformers	0	140	January 23, 2024
Troubleshooting 🤗Transformers	0	230	March 23, 2023
RuntimeError when running on Colab GPU 🤗Transformers	2	3498	November 28, 2021

Transformers on GCP Training stuck on start

Related topics