Purpose of padding and truncating

I have read the Preprocessing Data page. I understand what padding and truncating are doing, but I’m not sure I understand the reason for doing either of them. Can anyone help me understand the purpose for doing them? Thanks in advance!

Hi @aclifton314,
padding :
Padding is used to make all examples same length so that you can pack them in batch, sequences with uneven length can’t be batched. So if a sequence is shorter, than your max length then padding is used to make that sequence longer. Also some model might expect fixed length input, so padding help there too.

truncation:
Most of the models have max_lengths defined for them (there are exceptions, model with relative attention can take arbitrarily long sequences) for ex.for BERT max_length is 512, so if one of your sequence is longer than that you can’t feed it directly, so you need to truncate (drop extra tokens) to make the sequence smaller.

Hope this helps :slight_smile:

6 Likes

@valhalla crystal clear. Thank you very much!

So if I understand right, the sentence that is too long will see some of its tokens deleted, without using these extra tokens to train the future model?
Does these tokens are randomly selected or if the max_lenght=100 then tokens > 100 will be deleted?

Thank you :slight_smile:

Hi @kasar3, if you enable truncation then the extra tokens will be deleted , they are not selected randomly. This guide explains padding and truncation in detail.

1 Like

Alright thank you. :+1:

I have an other question, don’t know of it’s the right place… For example, in the case of text classification using BERT model (512 tokens max): let’s say that I have a sentence of 1000 tokens, what should I do to tokenize it like BERT want and still being able to classife it later. Should I split it manually in sentences of length <512 ?
For exemple split it in 2 sentences of 500 tokens while duplicate the text classification label of the original sentence?

I don’t have a good answer for this, but I have seen people breaking longer documents into chunks, getting the label for each chunk and then aggregate it.

Second approach is, now with recent models like longformer it’s possible to train models with large sequence length so that you won’t need to chunk the document.

@sgugger might have a better answer.

More than the labels, I’d get the pooler output of each chunk and then average them before using the classifier. This will require you to change the code of BertForSequenceClassification a little bit.

3 Likes