Trainer.train stalls when using tutorial example on GCP instance

stormrye · January 7, 2023, 7:44pm

Hi,
I have used HF for quite some time, tried using the Trainer interface following the example at Fine-tune a pretrained model
It runs fine on Google Colab, however when I try to run it on a Google Cloud instance it stalls at the very beginning after printing:

***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375
  Number of trainable parameters = 108314117

No error, nothing happens.

Steps to reproduce:

Create a compute instance on Google Cloud Platform (e.g. with V100 GPU) with PyTorch 1.13 image
pip install transformers datasets evaluate
Try to run this script example.py - JustPaste.it (example code from the tutorial)

Maybe someone has an idea what’s wrong?

underM · February 9, 2023, 3:53pm

I also encountered the same problem.
Did you resolve this issue?

underM · February 15, 2023, 7:25am

@stormrye
I solved this issue.
please uninstall torch_xla

pip uninstall torch_xla
pip3 uninstall torch_xla

Topic		Replies	Views
Transformers on GCP Training stuck on start 🤗Transformers	3	1207	March 22, 2023
Trainer object high memory usage on Google Cloud Platform Workbench instance 🤗Transformers	0	32	September 16, 2024
Bug in the train-with-pytorch-trainer? Beginners	2	387	September 1, 2023
Colab error (memory crashes) Beginners	3	3059	April 22, 2021
Error when training with `peft` + `lora` 🤗Transformers	1	1342	August 25, 2023

Trainer.train stalls when using tutorial example on GCP instance

Related topics