Understanding set_transform

jncasey · March 8, 2021, 11:51pm

Hi @lhoestq! I’m finally getting around to testing some set_transform workflows and I have a question.

I’ve passed a fairly CPU-heavy preprocessing function to set_transform. After about an hour of training, I forced the training to stop and then tried to resume from the last checkpoint.

It’s been over 15 minutes of heavy CPU activity since I resumed, and the training progress indicator is still on step zero. [UPDATE: training finally resumed after 46 minutes] Is it possible that my transform function being called on every sample as the trainer advances to the last checkpoint step?

If that is what’s happening, I’m not sure if it’s due to the dataset or trainer. Is there any way to avoid it?

Topic		Replies	Views
Using load_dataset.set_transform() function along with Trainer class 🤗Datasets	4	2603	April 26, 2021
Defining a custom dataset for fine-tuning translation Beginners	4	5083	July 10, 2021
One-to-many augmentations on the fly 🤗Datasets	6	927	April 6, 2023
Do we really preprocess the entire data set with hugging face even when we train very large language models e.g. gpt-3 size? Beginners	4	674	August 12, 2022
How to use set_transform when map becomes unfeasible? Intermediate	2	132	June 19, 2024

Understanding set_transform

Related topics