TPU VM training - each process loads the dataset

Vladislav-lsc · July 15, 2022, 10:38am

I’m trying to get a custom model training working and I’m following “Hugging Face on PyTorch / XLA TPUs”. I can start the process just fine with the command from “Train Your Transformer on Cloud TPUs”.

However, I’m using a custom txt dataset of 16GB. It seems that each process loads the dataset independently and my disk space of 200GB gets filled fast, causing the process to fail.

Is there a way to keep only one instance?

Vladislav-lsc · July 29, 2022, 3:02pm

One solution I’ve found was to first run with a single TPU node, which did the caching. A second run with all 8 TPU nodes activated had no problem, since the caching was there.

Topic		Replies	Views
🤗Transformer with Trainer API on TPU VMs and TPU Pods Beginners	0	408	December 18, 2023
Trainer with TPUs Beginners	3	2774	April 13, 2022
How to use TPU for BERT training Colab Beginners	1	956	July 30, 2022
TPU trainer with multi-core Intermediate	5	2206	April 21, 2022
TPU Memory problem when saving model checkpoint Beginners	0	552	April 7, 2022

TPU VM training - each process loads the dataset

Related topics