Data sampler based on number of tokens

bivouac0 · February 4, 2022, 10:06pm

I’m training the large bart model and I have to set my batch size very low in order to keep from getting OOM errors (with a 24GB 3090). The issue is that my sequence lengths are highly variable. The long ones which are close to the 1024 token limit need to be no more than 2 sequences per batch but many of the others could be batched 8 or more together. Does transformers have a way to batch based on the number of tokens instead of a fixed batch size? (BTW… I’m already using DataCollatorForSeq2Seq and group_by_length=True for dynamic padding of batches).

Topic		Replies	Views
Construct batch with token numbers Beginners	1	822	March 11, 2022
Efficient bucketing implementation 🤗Datasets	4	3625	May 16, 2022
Expected input batch_size.. to match target batch_size 🤗Transformers	0	515	July 1, 2022
BatchSampler - with trainer Beginners	0	213	July 20, 2023
Importance of padding for tokens and same size inputs for transformers 🤗Transformers	1	692	October 22, 2021

Data sampler based on number of tokens

Related topics