Understanding set_transform

jncasey · February 20, 2021, 12:21am

I’ve been working on a side project that uses phonetic English language models for text generation. Since I’m not aware of any existing phonetic English datasets, I’ve been preprocessing existing English text datasets with my phonemization script to give myself enough training data. Mainly OSCAR for pretraining the model, and then my own small datasets for fine-tuning on specific tasks.

My workflow has been:

downloading the 2.5TB oscar_en shuffled text file
processing it (in chunks) to its phonetic representation and saving those text files to disk
batch tokenizing those files and saving them to a local HuggingFace dataset, because it takes hours (or days) to tokenize the whole thing at the beginning of a training

Even with only 3% of the original OSCAR corpus phonemized, my dataset is up to over a 1.2TB on disk. Which I was okay with – I’m running out of local storage, but I was never going to be able to use the whole OSCAR corpus anyway on my rinky-dink home setup.

But this month has brought two things to HuggingFace – OSCAR in the datasets library, and on-the-fly transforms.

Am I right in understanding that I could load the oscar_en corpus from the HF Dataset, and then pass to set_transform a function that would phonemize and tokenize the samples, and the only hit to my disk would be the arrow cache of the original OSCAR dataset? And that I would be able to quickly resume training from my most recent checkpoint, since it’d just be loading from that cache?

I imagine it’ll slow the overall training down and I might not be able to feed my GPUs as quickly as I’d like, but the simplified workflow might be worth the performance hit (especially if I find another bug in my phonemizer script that makes me want to redo everything)

Building off of that, if one wanted to do a BART-style pretraining, would it be possible start with a single-column dataset, and pass to set_transform a function that returns the tokenized original dataset as the targets, and a randomly-masked version of the original tokens as the inputs, all on the fly? [Forgive me if this last question is dumb or nonsensical, I have a very limited understanding of seq2seq training / how BART works]

Topic		Replies	Views
Using load_dataset.set_transform() function along with Trainer class 🤗Datasets	4	2623	April 26, 2021
How to use set_transform when map becomes unfeasible? Intermediate	2	138	June 19, 2024
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5646	September 18, 2020
Transformed dataset to_json saves cache dataset Beginners	3	372	January 3, 2023
Set batch instead of full train dataset on Trainer 🤗Transformers	1	372	March 11, 2024

Understanding set_transform

Related topics