Model training without downloading data on a local storage

kensuke-mi · April 19, 2022, 4:00pm

Hi. I’m working on a Language model pertaining using BertForMaskedLM. My data are already tokenized and saved on Google cloud platform with arrow format. The total size of my dataset is around 300GB. My computational environment is Google Colab whose accelerator is TPU.

When I used the Tensorflow-based codebase, the codebase is possible to directly send data (tfrecord format) from GCS to TPU, which means the codebase does not download data to the local storage.

Now, my codebase is PyTorch-based. Is it possible to make the same procedure as Tensorflow? I mean, the Pytorch-based codebase can directly send data from GCS to TPU? Since Google Colab’s storage size is quite small, it’s impossible to download 300GB dataset…

Thank you!

Topic		Replies	Views
Tutorials for using Colab TPUs with Huggingface Transformers? 🤗Transformers	16	20779	June 3, 2024
How can I run this code on Kaggle TPU? Runs fine with GPU 🤗Transformers	0	356	November 20, 2022
Training of GPT hang during Checkpoint stage 🤗Transformers	0	138	January 23, 2024
Use tf.data.Data with HuggingFace datasets 🤗Transformers	2	2641	March 22, 2021
Trainer with Google Colab TPU? Beginners	0	655	April 25, 2022

Model training without downloading data on a local storage

Related topics