Understanding set_transform

I’ve been working on a side project that uses phonetic English language models for text generation. Since I’m not aware of any existing phonetic English datasets, I’ve been preprocessing existing English text datasets with my phonemization script to give myself enough training data. Mainly OSCAR for pretraining the model, and then my own small datasets for fine-tuning on specific tasks.

My workflow has been:

  1. downloading the 2.5TB oscar_en shuffled text file
  2. processing it (in chunks) to its phonetic representation and saving those text files to disk
  3. batch tokenizing those files and saving them to a local HuggingFace dataset, because it takes hours (or days) to tokenize the whole thing at the beginning of a training

Even with only 3% of the original OSCAR corpus phonemized, my dataset is up to over a 1.2TB on disk. Which I was okay with – I’m running out of local storage, but I was never going to be able to use the whole OSCAR corpus anyway on my rinky-dink home setup.

But this month has brought two things to HuggingFace – OSCAR in the datasets library, and on-the-fly transforms.

Am I right in understanding that I could load the oscar_en corpus from the HF Dataset, and then pass to set_transform a function that would phonemize and tokenize the samples, and the only hit to my disk would be the arrow cache of the original OSCAR dataset? And that I would be able to quickly resume training from my most recent checkpoint, since it’d just be loading from that cache?

I imagine it’ll slow the overall training down and I might not be able to feed my GPUs as quickly as I’d like, but the simplified workflow might be worth the performance hit (especially if I find another bug in my phonemizer script that makes me want to redo everything)

Building off of that, if one wanted to do a BART-style pretraining, would it be possible start with a single-column dataset, and pass to set_transform a function that returns the tokenized original dataset as the targets, and a randomly-masked version of the original tokens as the inputs, all on the fly? [Forgive me if this last question is dumb or nonsensical, I have a very limited understanding of seq2seq training / how BART works]