Struggle with training on TPU using 'accelerate' library

MMing · March 4, 2022, 1:59am

My code is as following, is there something wrong? if no, why so slowly(or freeze) when running on Kaggle TPU device…

!pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!pip install accelerate

from accelerate import Accelerator
import torch_xla.core.xla_model as xm


def training_function():

	dataloader = get_dataloader()  # NLP task, all batch have padding into same length.
	model = get_model()
	optimizer = get_optimizer(model)

	accelerator = Accelerator()
	model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

	for _ in range(20):
		for batch in dataloader:
			accelerator.backward(model(**batch).loss)
			optimizer.step()
			optimizer.zero_grad()

			xm.mark_step()  # <-is need this?


if "__main__" == __name__:
	from accelerate import notebook_launcher

	notebook_launcher(training_function)  # Kaggle, TPU

sgugger · March 4, 2022, 2:29pm

You don’t need to add the xm.mark_step(), this is automatically done by Accelerate.
I don’t see anything wrong with your code, is the launcher telling you it’s launching training on 8 TPUs?

MMing · March 5, 2022, 3:18am

Thank you for your answer.
Yes, the launcher tell me training on 8 TPUs every time, but Why can only be 1 or 8 TPU(s), not other quantities?

sgugger · March 7, 2022, 12:55pm

That’s on Google, they don’t allow anything else

Topic		Replies	Views
Accelerate TPU training 🤗Accelerate	0	129	July 5, 2024
Found a BUG and basic docs code fails to run on kaggle tpu 🤗Accelerate	0	350	September 15, 2023
Accelerate / TPU with bigger models: process 0 terminated with signal SIGKILL 🤗Accelerate	2	3744	May 13, 2022
[Kaggle] TPUVM doesn't allow setting nprocs > 1 🤗Accelerate	1	1004	April 9, 2023
Accelerator() causes Error Beginners	2	379	April 12, 2024

Struggle with training on TPU using 'accelerate' library

Related topics