Are dynamic padding and smart batching in the library?

minji · September 30, 2021, 2:47am

my code:

return tokenizer(list(dataset['sentense']),
                          padding = True,
                          truncation = True,
                          max_length = 128 )

  training_args = TrainingArguments(
    output_dir='./results',          # output directory
    save_total_limit=5,              # number of total save model.
    save_steps=5000,                 # model saving step.
    num_train_epochs=20,              # total number of training epochs
    learning_rate=5e-5,               # learning_rate
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
...

Hello, I understand the docs like that

If I want to use dynamic padding → padding=True
or not (max_length) → padding = 'max_length

Is it right?

And I want to use smart batching,
Does per_device_train_batch_size automatically support this feature?
If not, I wonder if there is anything that make me to use smart batching.

Thanks!!

nielsr · September 30, 2021, 9:42am

Hi,

This video makes it quite clear: What is dynamic padding? - YouTube

In order to use dynamic padding in combination with the Trainer, one typically postpones the padding, by only specifying truncation=True when preprocessing the dataset, and then using the DataCollatorWithPadding when defining the data loaders, which will dynamically pad the batches.

minji · September 30, 2021, 2:45pm

@nielsr
Thanks. what about smart batching? Is there any tutorial video??

sgugger · September 30, 2021, 4:07pm

If by smart batching you mean grouping together samples of the same length, it is implemented in the Trainer by adding group_by_length=True in your TrainingArguments.

tomroth1001 · November 29, 2021, 3:52am

Just wondering - how would you usually do this without using the Trainer?

athar · December 9, 2021, 1:22am

@sgugger hii
if i used group_by_length=True in my TrainingArguments .
this mean applying smart batch??? please answer me

sgugger · December 9, 2021, 1:29am

Please do not post the same message three times and tag users agressively like you did. You can always edit your message instead of reposting the same thing.

athar · December 9, 2021, 1:58am

@sgugger sorry i think no one hear me as the other website
plz can answer me about my question

minji · December 10, 2021, 5:22am

I understand like it

ablam · June 20, 2022, 10:38pm

What if we get this error? Using pad_token, but it is not set yet. Seems that we must also specify what to pad with. It is not the default 0 I presume. Thanks!

minji · June 21, 2022, 5:45am

I think you need to set a pad token like this

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

ablam · June 21, 2022, 2:05pm

Yes, I did that thank you. By eos_token, does that just add the last token in line to fill the spots to reach uniformity across the batch? So its value depends on the last token it sees.

deathcrush · March 1, 2023, 6:32pm

@sgugger This is completely off topic but do you think we could implement grouping by length inside a pipeline to prevent slowdowns due to large differences in sequence lengths? This would only be implemented for users that run the pipeline on a Dataset object. I’d be happy to contribute this. What would be an appropriate forum to discuss details?

sgugger · March 1, 2023, 9:31pm

Might be an interesting idea cc @Narsil

alvations · April 7, 2023, 12:07pm

Streamline the discussion on dynamic/smart batching, are there existing libraries that does that well with native Pytorch/Tensorflow?

(Out of curiosity) Also asking on pytorch - Dynamic batching and padding batches for NLP in deep learning libraries - Data Science Stack Exchange

Topic		Replies	Views
Padding in datasets 🤗Datasets	6	5035	October 21, 2021
Training with varying lengths of sequences Beginners	0	1613	May 31, 2023
Dynamic padding not working for audio custom dataset Beginners	0	789	February 8, 2022
Passing the tokenizer to Trainer for bucketing does not work for evaluation set 🤗Transformers	5	1627	October 23, 2020
Can't iterate through the data loader object after dynamic padding Beginners	1	844	July 8, 2022

Are dynamic padding and smart batching in the library?

Related topics