Understanding set_transform

I’ve been working on a side project that uses phonetic English language models for text generation. Since I’m not aware of any existing phonetic English datasets, I’ve been preprocessing existing English text datasets with my phonemization script to give myself enough training data. Mainly OSCAR for pretraining the model, and then my own small datasets for fine-tuning on specific tasks.

My workflow has been:

  1. downloading the 2.5TB oscar_en shuffled text file
  2. processing it (in chunks) to its phonetic representation and saving those text files to disk
  3. batch tokenizing those files and saving them to a local HuggingFace dataset, because it takes hours (or days) to tokenize the whole thing at the beginning of a training

Even with only 3% of the original OSCAR corpus phonemized, my dataset is up to over a 1.2TB on disk. Which I was okay with – I’m running out of local storage, but I was never going to be able to use the whole OSCAR corpus anyway on my rinky-dink home setup.

But this month has brought two things to HuggingFace – OSCAR in the datasets library, and on-the-fly transforms.

Am I right in understanding that I could load the oscar_en corpus from the HF Dataset, and then pass to set_transform a function that would phonemize and tokenize the samples, and the only hit to my disk would be the arrow cache of the original OSCAR dataset? And that I would be able to quickly resume training from my most recent checkpoint, since it’d just be loading from that cache?

I imagine it’ll slow the overall training down and I might not be able to feed my GPUs as quickly as I’d like, but the simplified workflow might be worth the performance hit (especially if I find another bug in my phonemizer script that makes me want to redo everything)

Building off of that, if one wanted to do a BART-style pretraining, would it be possible start with a single-column dataset, and pass to set_transform a function that returns the tokenized original dataset as the targets, and a randomly-masked version of the original tokens as the inputs, all on the fly? [Forgive me if this last question is dumb or nonsensical, I have a very limited understanding of seq2seq training / how BART works]

set_transform does not cache the resulting data. Depending on the data/storage you have available you may want to opt for map. Both have a low memory footprint.

Great, thanks for confirming that it doesn’t cache to disk. That’s exactly what I was hoping.

I guess now I’ll have to update to the latest masters and start testing how much on-the-fly tokenization and other data transforms slow my training down.

Indeed if you use set_transform then the resulting phonemized data are created on-the-fly and not stored/cached. Only the original OSCAR data are stored on your disk as an arrow file.

And you’re right your second point about BART-style pretraining: you can pass a function to set_transform that returns two fields, one that is the original text and one that is randomly masked, even if you have only one column in your dataset.

Thanks, @lhoestq! That’s an incredibly cool and useful feature. Can’t wait to play with it.