Accelerate TPU training

hannayukhymenko · July 5, 2024, 2:23pm

Hi! I am using GCP v4-16 TPU Pod for training an LLM. I am learning how to do big distributed model training on TPU and I have several issues with using accelerate for TPU:

Accelerate fails to recognize XLA device, when I run the training script without accelerate I do not have this issue. I have tried setting up config using accelerate config and it still fails to find the TPU device.
Does accelerate support multi-worker/node TPU training, similarly to Multi-GPU in distributed_type? I might be incorrect here since I haven’t used accelerate before - should I manually assign workers to acelerate or it handles this process by itself?

Thanks!

Topic		Replies	Views
Does accelerate API support FSDP on TPU Pods? (accelerate config doesn't seem to allow this) 🤗Accelerate	0	403	October 8, 2023
Struggle with training on TPU using 'accelerate' library 🤗Accelerate	3	1710	March 7, 2022
Multi-GPU Training sometimes working with 2GPU, but never more than 2 🤗Accelerate	5	2980	August 8, 2024
🤗Transformer with Trainer API on TPU VMs and TPU Pods Beginners	0	407	December 18, 2023
FLAX - Training on Cloud TPU VM Pods (not single TPU devices) Beginners	1	1405	August 2, 2022

Accelerate TPU training

Related topics