Construct batch with token numbers

The default batch size in huggingface’s TrainingArguments is 8 samples. However, when I finetune models like T5 on machine translation task, batch size of fixed number of samples are not efficient for GPU utilization.

I wonder whether Huggingface/datasets has the batch construction method by setting token numbers. For example, if we set the batch size of 3000 tokens, then the batch would have several samples, and the total number of tokens are around 3000.

Is there anyone can help me?
Emergency!

Thanks a lot ~!

1 Like

I don’t think I understand your inquiry? Datasets you can set a max_token size