I’m trying to use the load_dataset.set_transform(…) function along with DataCollatorForLanguageModeling and the Trainer class from the transformers library to pretrain a model. Since I have a large dataset, tokenization does not fit in RAM and using the .map() function uses way too much disk space (> 500Go) which is limited in my case. So I need to tokenize on the fly.
While the set_transform works as expected if I index the dataset, I don’t know why it fails when I plug it with a DataCollatorForLanguageModeling and a Trainer.
hey @tkon3, when you say “it fails” what do you mean exactly? there’s a related thread on set_transform here that might be useful: Understanding set_transform
ah i think that’s because the trainer removes unused columns like text by default (see docs). what happens if you set remove_unused_columns=False in your TrainingArguments?