Model training without downloading data on a local storage

Hi. I’m working on a Language model pertaining using BertForMaskedLM. My data are already tokenized and saved on Google cloud platform with arrow format. The total size of my dataset is around 300GB. My computational environment is Google Colab whose accelerator is TPU.

When I used the Tensorflow-based codebase, the codebase is possible to directly send data (tfrecord format) from GCS to TPU, which means the codebase does not download data to the local storage.

Now, my codebase is PyTorch-based. Is it possible to make the same procedure as Tensorflow? I mean, the Pytorch-based codebase can directly send data from GCS to TPU? Since Google Colab’s storage size is quite small, it’s impossible to download 300GB dataset…

Thank you!