Set_transform and group_by_length=True

tadf · June 9, 2021, 6:52am

To my understanding, set_transform should do transformations on the fly such that the gpu can immediately use if for training.

When I specify group_by_length=True on the trainer, set_transform no longer does lazy eval, it goes through the whole dataset – my hunch is that it needs to do all the transformations first to be able to group by length.

Is this behavior intended? I think the group_by_length should only be limited to the batch size (or a smaller subset of the dataset) and not to the whole dataset

sgugger · June 9, 2021, 1:12pm

No, group_by_lengths need to read all the lengths of the dataset to be able to build batches of similar lengths.

tadf · June 10, 2021, 2:27am

Is there a way to narrow down the group_by_length to smaller subsets - without the need to shard dataset?

sgugger · June 10, 2021, 8:02pm

No, this is not implemented.

Topic		Replies	Views
Grouping by length makes training loss oscillate and makes evaluation loss worse 🤗Transformers	2	236	June 3, 2025
How to implement Trainer's 'group_by_length' in PyTorch? Beginners	1	1784	September 25, 2023
Trainer being very slow to init training setting group_by_length to True 🤗Transformers	1	300	February 1, 2025
I set up a different batch_size, but the time of data processing has not changed 🤗Tokenizers	0	537	September 1, 2021
Using load_dataset.set_transform() function along with Trainer class 🤗Datasets	4	2603	April 26, 2021

Set_transform and group_by_length=True

Related topics