TPU trainer with multi-core

Kioto97 · April 19, 2022, 8:49am

Hello, I’d like to train a SQuAD model by exploiting the 8-cores TPU offered by google colab.
I followed the tutorial for fine-tuning using the Trainer API by @sgugger
In the tutorial it states that the Trainer is made for supporting TPUs out of the box, so the only thing I’ve added to my code are the XLA library install/imports, then I’ve wrapped the .train() function inside _mp_fn() and gave it to xmp.spawn() , specifying 8 as num of cores.

But I get the following error:

Exception in device=TPU:0: Cannot replicate if number of devices (1) is different from 8

After a research, I found out that this error is raised when the XLA device is called outside the spawn process, but I have no such calls, so maybe it’s wrapped inside one of the Huggingface functions, but how do I disable it?
The code works just fine with just 1 core, but I’d like to exploit all 8.
This is my notebook:

sgugger · April 19, 2022, 12:19pm

You need to define the TrainingArguments inside you multiprocess function, and you should also define your model in that function:

def _mp_fn(rank, flags):
    data_collator = default_data_collator
    model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
    model_name = model_checkpoint.split("/")[-1]
    args = TrainingArguments(
        f"{model_name}-finetuned-squad",
        evaluation_strategy = "epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=3,
        weight_decay=0.01,
        push_to_hub=False,
        tpu_num_cores=8,
    )
    torch.set_default_tensor_type('torch.FloatTensor')
    trainer = Trainer(
        model,
        args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )

    trainer.train()

This should work properly.

kensuke-mi · April 19, 2022, 4:08pm

Hi. I tried tpu_num_cores=8 to the TrainingArguments class. I also run codes on Google colab. Then, I encounter the following error message.

2022-04-19 15:51:14.312185: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** Begin stack trace ***
2022-04-19 15:51:14.312193: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	tensorflow::CurrentStackTrace()
2022-04-19 15:51:14.312202: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	xla::util::ReportComputationError(tensorflow::Status const&, absl::lts_20211102::Span<xla::XlaComputation const* const>, absl::lts_20211102::Span<xla::Shape const* const>)
2022-04-19 15:51:14.312211: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	xla::util::ShapeHash(xla::Shape const&)
2022-04-19 15:51:14.312219: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20211102::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2022-04-19 15:51:14.312227: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	
2022-04-19 15:51:14.312234: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	xla::util::MultiWait::Complete(std::function<void ()> const&)
2022-04-19 15:51:14.312240: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	
2022-04-19 15:51:14.312246: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	
2022-04-19 15:51:14.312253: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	
2022-04-19 15:51:14.312259: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	clone
2022-04-19 15:51:14.312265: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** End stack trace ***
2022-04-19 15:51:14.312271: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
2022-04-19 15:51:14.312277: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Status: INTERNAL: From /job:tpu_worker/replica:0/task:0:
2022-04-19 15:51:14.312289: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 2 root error(s) found.
2022-04-19 15:51:14.312300: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (0) INTERNAL: stream did not block host until done; was already in an error state
2022-04-19 15:51:14.312310: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	 [[{{node XRTExecute}}]]
2022-04-19 15:51:14.312320: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	 [[XRTExecute_G15]]
2022-04-19 15:51:14.312329: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (1) INTERNAL: stream did not block host until done; was already in an error state
2022-04-19 15:51:14.312338: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 	 [[{{node XRTExecute}}]]
2022-04-19 15:51:14.312348: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 successful operations.
2022-04-19 15:51:14.312364: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 derived errors ignored.
  0% 1/21250 [00:00<2:20:22,  2.52it/s]Traceback (most recent call last):
  File "train_bert.py", line 71, in <module>
    dataset_default_key='train'
  File "/content/common-crawal-preprocess/model_trainings/procedure_torch/train_model.py", line 69, in main
    trainer_obj.train()
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1306, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/parallel_loader.py", line 34, in __next__
    return self.next()
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/parallel_loader.py", line 46, in next
    xm.mark_step()
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 787, in mark_step
    wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: INTERNAL: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: stream did not block host until done; was already in an error state
	 [[{{node XRTExecute}}]]
	 [[XRTExecute_G15]]
  (1) INTERNAL: stream did not block host until done; was already in an error state
	 [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.

What does the message mean?
Be informed that the error message does not appear when I do not specify any to tpu_num_cores.

Kioto97 · April 20, 2022, 10:16am

Thank you! It works properly now.

@kensuke-mi I couldn’t reproduce your error, did you try to restart the runtime?

kensuke-mi · April 20, 2022, 11:00am

@Kioto97 You’re right. I was supposed to restart the Colab instance. After the initialization, the error went away. So, is your computation speed much faster with tpu_num_cores=8? The speed becomes slower in my configuration. I tested it with per_device_train_batch_size=16.

Kioto97 · April 21, 2022, 12:28pm

To be honest I haven’t managed to run it yet, on colab, even with a lower batch size, I get an error:

process 4 terminated with signal SIGKILL

Which I suspect is due to an insufficient amount of RAM. Did you change anything else ?

Topic		Replies	Views
Trainer with TPUs Beginners	3	2763	April 13, 2022
TPU slow finetuning T5-base Models	13	3048	June 17, 2022
🤗Transformer with Trainer API on TPU VMs and TPU Pods Beginners	0	407	December 18, 2023
Set TPU device in Trainer Beginners	5	2607	October 15, 2024
How to save model in Colab during TPU training with Accelerate Intermediate	2	1385	November 19, 2021

TPU trainer with multi-core

Related topics