With accelerate and colab tpu all devices always xla:0 and none of them is_main_process

Here is code

So, seems like a bug, but I’m not sure

I’m not sure where in the notebook you see that. Could you post a minimal reproducer? Printing the accelerator.device in the training function shows 8 different devices.

Just launched notebook, I’m getting such output from my prints

    print(accelerator.is_main_process)
    print(accelerator.device)
False
xla:0
xla:0
False
xla:0
False
xla:0
False
xla:0
False
xla:0
False
xla:0

So, all is not main process and all xla 0 while I expect xla 0-7 and one main process.

Also, left outputs in notebook, so you can see output from train loop.

You are right, it looks like it associates the device 0 to all but one process that gets the device 1 (on my example). Not sure what’s going on, I’m investigating.

Are there any news on this topic yet?

Ive the same problem and i think it could be related to the (wandb login/wandb.init).
Because everything is working fine if i comment out this section but if i leave it in, then device 1 stops there all the time.

I also tried to use the (wandb login/wandb.init) before i use the notebook_launcher but then i get an error msg from WandB because of the new pid…

This is all completely normal according to the PyTorch XLA team. You can’t trust the device, as it’s going to give you xla:0 all the time. But accelerator.process_index will give you the right process index and only one process will have accelerator.is_main_process=True.

Thats true but why is the wandb login not working when i use the notebook_launcher?
Is there any way to use wandb in a tpu runtime with the accelerator or do i have to choose between the accelerator and wandb?

My problem is that i dont even get any error message.
The main process just never gets further than the code line with the login while the other cores continue to work normally.

Sorry if I use the wrong topic, but since the error occurred in the same place I thought that it is the right place for my question.

This sounds like an issue you should ask Wandb people :slight_smile: