Trainer.train stalls when using tutorial example on GCP instance

Hi,
I have used HF for quite some time, tried using the Trainer interface following the example at Fine-tune a pretrained model
It runs fine on Google Colab, however when I try to run it on a Google Cloud instance it stalls at the very beginning after printing:

***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375
  Number of trainable parameters = 108314117

No error, nothing happens.

Steps to reproduce:

  • Create a compute instance on Google Cloud Platform (e.g. with V100 GPU) with PyTorch 1.13 image
  • pip install transformers datasets evaluate
  • Try to run this script example.py - JustPaste.it (example code from the tutorial)

Maybe someone has an idea what’s wrong?

1 Like

I also encountered the same problem.
Did you resolve this issue?

@stormrye
I solved this issue.
please uninstall torch_xla

pip uninstall torch_xla
pip3 uninstall torch_xla
2 Likes