Construct batch with token numbers

MACong · March 11, 2022, 9:30am

The default batch size in huggingface’s TrainingArguments is 8 samples. However, when I finetune models like T5 on machine translation task, batch size of fixed number of samples are not efficient for GPU utilization.

I wonder whether Huggingface/datasets has the batch construction method by setting token numbers. For example, if we set the batch size of 3000 tokens, then the batch would have several samples, and the total number of tokens are around 3000.

Is there anyone can help me?
Emergency!

Thanks a lot ~!

anwarika · March 11, 2022, 2:00pm

I don’t think I understand your inquiry? Datasets you can set a max_token size

Topic		Replies	Views
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10377	August 10, 2023
About pre-training the bert-base-cased model Models	0	187	October 17, 2023
Making multiple samples from single samples using HuggingFace Datasets 🤗Datasets	6	2653	March 3, 2024
Trainer) training one batch with multiple GPUs DeepSpeed	0	394	June 19, 2023
Trainer log output reports 0 samples in dataset 🤗Transformers	0	275	July 18, 2022

Construct batch with token numbers

Related topics