training_args = TrainingArguments(
output_dir='./results', # output directory
save_total_limit=5, # number of total save model.
save_steps=5000, # model saving step.
num_train_epochs=20, # total number of training epochs
learning_rate=5e-5, # learning_rate
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=16, # batch size for evaluation
...
Hello, I understand the docs like that
If I want to use dynamic padding → padding=True
or not (max_length) → padding = 'max_length
Is it right?
And I want to use smart batching,
Does per_device_train_batch_size automatically support this feature?
If not, I wonder if there is anything that make me to use smart batching.
In order to use dynamic padding in combination with the Trainer, one typically postpones the padding, by only specifying truncation=True when preprocessing the dataset, and then using the DataCollatorWithPadding when defining the data loaders, which will dynamically pad the batches.
If by smart batching you mean grouping together samples of the same length, it is implemented in the Trainer by adding group_by_length=True in your TrainingArguments.
Please do not post the same message three times and tag users agressively like you did. You can always edit your message instead of reposting the same thing.
What if we get this error? Using pad_token, but it is not set yet. Seems that we must also specify what to pad with. It is not the default 0 I presume. Thanks!
Yes, I did that thank you. By eos_token, does that just add the last token in line to fill the spots to reach uniformity across the batch? So its value depends on the last token it sees.
@sgugger This is completely off topic but do you think we could implement grouping by length inside a pipeline to prevent slowdowns due to large differences in sequence lengths? This would only be implemented for users that run the pipeline on a Dataset object. I’d be happy to contribute this. What would be an appropriate forum to discuss details?