Are dynamic padding and smart batching in the library?

my code:

return tokenizer(list(dataset['sentense']),
                          padding = True,
                          truncation = True,
                          max_length = 128 )
  training_args = TrainingArguments(
    output_dir='./results',          # output directory
    save_total_limit=5,              # number of total save model.
    save_steps=5000,                 # model saving step.
    num_train_epochs=20,              # total number of training epochs
    learning_rate=5e-5,               # learning_rate
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
...

Hello, I understand the docs like that

If I want to use dynamic padding → padding=True
or not (max_length) → padding = 'max_length

Is it right?

And I want to use smart batching,
Does per_device_train_batch_size automatically support this feature?
If not, I wonder if there is anything that make me to use smart batching.

Thanks!!

1 Like

Hi,

This video makes it quite clear: What is dynamic padding? - YouTube

In order to use dynamic padding in combination with the Trainer, one typically postpones the padding, by only specifying truncation=True when preprocessing the dataset, and then using the DataCollatorWithPadding when defining the data loaders, which will dynamically pad the batches.

2 Likes

@nielsr
Thanks. what about smart batching? Is there any tutorial video??

If by smart batching you mean grouping together samples of the same length, it is implemented in the Trainer by adding group_by_length=True in your TrainingArguments.

4 Likes

Just wondering - how would you usually do this without using the Trainer?

1 Like

@sgugger hii
if i used group_by_length=True in my TrainingArguments .
this mean applying smart batch??? please answer me

1 Like

Please do not post the same message three times and tag users agressively like you did. You can always edit your message instead of reposting the same thing.

1 Like

@sgugger sorry i think no one hear me as the other website
plz can answer me about my question

I understand like it

What if we get this error? Using pad_token, but it is not set yet. Seems that we must also specify what to pad with. It is not the default 0 I presume. Thanks!

I think you need to set a pad token like this

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
1 Like

Yes, I did that thank you. By eos_token, does that just add the last token in line to fill the spots to reach uniformity across the batch? So its value depends on the last token it sees.

@sgugger This is completely off topic but do you think we could implement grouping by length inside a pipeline to prevent slowdowns due to large differences in sequence lengths? This would only be implemented for users that run the pipeline on a Dataset object. I’d be happy to contribute this. What would be an appropriate forum to discuss details?

Might be an interesting idea cc @Narsil

Streamline the discussion on dynamic/smart batching, are there existing libraries that does that well with native Pytorch/Tensorflow?

(Out of curiosity) Also asking on pytorch - Dynamic batching and padding batches for NLP in deep learning libraries - Data Science Stack Exchange