Are dynamic padding and smart batching in the library?

my code:

return tokenizer(list(dataset['sentense']),
                          padding = True,
                          truncation = True,
                          max_length = 128 )
  training_args = TrainingArguments(
    output_dir='./results',          # output directory
    save_total_limit=5,              # number of total save model.
    save_steps=5000,                 # model saving step.
    num_train_epochs=20,              # total number of training epochs
    learning_rate=5e-5,               # learning_rate
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
...

Hello, I understand the docs like that

If I want to use dynamic padding → padding=True
or not (max_length) → padding = 'max_length

Is it right?

And I want to use smart batching,
Does per_device_train_batch_size automatically support this feature?
If not, I wonder if there is anything that make me to use smart batching.

Thanks!!

1 Like

Hi,

This video makes it quite clear: What is dynamic padding? - YouTube

In order to use dynamic padding in combination with the Trainer, one typically postpones the padding, by only specifying truncation=True when preprocessing the dataset, and then using the DataCollatorWithPadding when defining the data loaders, which will dynamically pad the batches.

1 Like

@nielsr
Thanks. what about smart batching? Is there any tutorial video??

If by smart batching you mean grouping together samples of the same length, it is implemented in the Trainer by adding group_by_length=True in your TrainingArguments.

4 Likes

Just wondering - how would you usually do this without using the Trainer?

1 Like

@sgugger hii
if i used group_by_length=True in my TrainingArguments .
this mean applying smart batch??? please answer me

1 Like

Please do not post the same message three times and tag users agressively like you did. You can always edit your message instead of reposting the same thing.

1 Like

@sgugger sorry i think no one hear me as the other website
plz can answer me about my question

I understand like it

What if we get this error? Using pad_token, but it is not set yet. Seems that we must also specify what to pad with. It is not the default 0 I presume. Thanks!

I think you need to set a pad token like this

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
1 Like

Yes, I did that thank you. By eos_token, does that just add the last token in line to fill the spots to reach uniformity across the batch? So its value depends on the last token it sees.