Understanding set_transform

lhoestq · February 21, 2021, 7:56pm

Indeed if you use set_transform then the resulting phonemized data are created on-the-fly and not stored/cached. Only the original OSCAR data are stored on your disk as an arrow file.

And you’re right your second point about BART-style pretraining: you can pass a function to set_transform that returns two fields, one that is the original text and one that is randomly masked, even if you have only one column in your dataset.

Topic		Replies	Views
Using load_dataset.set_transform() function along with Trainer class 🤗Datasets	4	2616	April 26, 2021
How to use set_transform when map becomes unfeasible? Intermediate	2	135	June 19, 2024
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5636	September 18, 2020
Transformed dataset to_json saves cache dataset Beginners	3	370	January 3, 2023
Set batch instead of full train dataset on Trainer 🤗Transformers	1	372	March 11, 2024

Understanding set_transform

Related topics