How padding in huggingface tokenizer works?

I tried following tokenization example:

tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True)
sent = "I hate this. Not that.",        
_tokenized = tokenizer(sent, padding=True, max_length=20, truncation=True)
print(_tknzr.decode(_tokenized['input_ids'][0]))
print(len(_tokenized['input_ids'][0]))

The output was:

[CLS] i hate this. not that. [SEP]
9

Notice the parameter to tokenizer: max_length=20. How can I make Bert tokenizer to append 11 [PAD] tokens to this sentence to make it total 20?

You need to change padding to "max_length". The default behavior (with padding=True) is to pad to the length of the longest sentence in the batch, meanwhile sentences longer than specified length are getting truncated to the specified max_length. In your example you have only one sentence, thus there’s no padding (the only sentence is the longest one). Your sentence is shorter than max length, so there’s no truncation either.

1 Like

Great thanks!!! It worked.
But how one can know that padding does indeed accept string value max_length? I tried to go through both of the tokenizer pages: tokenizer and BertTokenizer. But none of these pages state that padding does indeed accept string values like max_length. Now I am guessing what else it might be accepting and where can I find the whole list.

Am I not reading carefully? Can you please tell how can I navigate the docs to obtain this information / link to doc stating this?

Well, you can call help on an object or a specific method to see more info. For instance, help(tokenizer.__call__) will display the documentation on the method that you’re using in your example. It’s the safest bet, in my opinion. However, the implementation of the method is inherited from PreTrainedTokenizerBase and accordingly the related docs can be found here.

Although, I do agree with you that not seeing this info on the child classes’ pages may be quite confusing.

1 Like

Yes / No.
Keeping inherited stuff out of child class’ docs might reduce clutter.
But I guess it also serves some purpose like having a glance on what all things available on class without navigating whole class hierarchy. Then, a simple switch to show/hide inheritted stuff would be useful… I guess javadoc follow this approach.