Purpose of padding and truncating

aclifton314 · July 21, 2020, 3:03pm

I have read the Preprocessing Data page. I understand what padding and truncating are doing, but I’m not sure I understand the reason for doing either of them. Can anyone help me understand the purpose for doing them? Thanks in advance!

valhalla · July 21, 2020, 3:24pm

Hi @aclifton314,
padding :
Padding is used to make all examples same length so that you can pack them in batch, sequences with uneven length can’t be batched. So if a sequence is shorter, than your max length then padding is used to make that sequence longer. Also some model might expect fixed length input, so padding help there too.

truncation:
Most of the models have max_lengths defined for them (there are exceptions, model with relative attention can take arbitrarily long sequences) for ex.for BERT max_length is 512, so if one of your sequence is longer than that you can’t feed it directly, so you need to truncate (drop extra tokens) to make the sequence smaller.

Hope this helps

aclifton314 · July 21, 2020, 3:27pm

@valhalla crystal clear. Thank you very much!

kasar3 · August 3, 2020, 12:23pm

So if I understand right, the sentence that is too long will see some of its tokens deleted, without using these extra tokens to train the future model?
Does these tokens are randomly selected or if the max_lenght=100 then tokens > 100 will be deleted?

Thank you

valhalla · August 3, 2020, 12:44pm

Hi @kasar3, if you enable truncation then the extra tokens will be deleted , they are not selected randomly. This guide explains padding and truncation in detail.

kasar3 · August 3, 2020, 1:25pm

Alright thank you.

I have an other question, don’t know of it’s the right place… For example, in the case of text classification using BERT model (512 tokens max): let’s say that I have a sentence of 1000 tokens, what should I do to tokenize it like BERT want and still being able to classife it later. Should I split it manually in sentences of length <512 ?
For exemple split it in 2 sentences of 500 tokens while duplicate the text classification label of the original sentence?

valhalla · August 3, 2020, 4:50pm

I don’t have a good answer for this, but I have seen people breaking longer documents into chunks, getting the label for each chunk and then aggregate it.

Second approach is, now with recent models like longformer it’s possible to train models with large sequence length so that you won’t need to chunk the document.

@sgugger might have a better answer.

sgugger · August 3, 2020, 5:01pm

More than the labels, I’d get the pooler output of each chunk and then average them before using the classifier. This will require you to change the code of BertForSequenceClassification a little bit.

Topic		Replies	Views
Need clarity on "padding" parameter in Bert Tokenizer 🤗Tokenizers	0	486	December 8, 2022
Why does padding = 'max_length' cause much slower model inference? Models	1	621	June 8, 2023
How padding in huggingface tokenizer works? 🤗Tokenizers	4	6785	November 22, 2021
How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace? 🤗Tokenizers	0	937	May 15, 2022
Training with varying lengths of sequences Beginners	0	1619	May 31, 2023

Purpose of padding and truncating

Related topics