How padding in huggingface tokenizer works?

RajSingh333 · November 22, 2021, 5:10pm

I tried following tokenization example:

tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True)
sent = "I hate this. Not that.",        
_tokenized = tokenizer(sent, padding=True, max_length=20, truncation=True)
print(_tknzr.decode(_tokenized['input_ids'][0]))
print(len(_tokenized['input_ids'][0]))

The output was:

[CLS] i hate this. not that. [SEP]
9

Notice the parameter to tokenizer: max_length=20. How can I make Bert tokenizer to append 11 [PAD] tokens to this sentence to make it total 20?

adorkin · November 22, 2021, 6:38pm

You need to change padding to "max_length". The default behavior (with padding=True) is to pad to the length of the longest sentence in the batch, meanwhile sentences longer than specified length are getting truncated to the specified max_length. In your example you have only one sentence, thus there’s no padding (the only sentence is the longest one). Your sentence is shorter than max length, so there’s no truncation either.

RajSingh333 · November 22, 2021, 7:28pm

Great thanks!!! It worked.
But how one can know that padding does indeed accept string value max_length? I tried to go through both of the tokenizer pages: tokenizer and BertTokenizer. But none of these pages state that padding does indeed accept string values like max_length. Now I am guessing what else it might be accepting and where can I find the whole list.

Am I not reading carefully? Can you please tell how can I navigate the docs to obtain this information / link to doc stating this?

adorkin · November 22, 2021, 8:58pm

Well, you can call help on an object or a specific method to see more info. For instance, help(tokenizer.__call__) will display the documentation on the method that you’re using in your example. It’s the safest bet, in my opinion. However, the implementation of the method is inherited from PreTrainedTokenizerBase and accordingly the related docs can be found here.

Although, I do agree with you that not seeing this info on the child classes’ pages may be quite confusing.

RajSingh333 · November 22, 2021, 10:30pm

Yes / No.
Keeping inherited stuff out of child class’ docs might reduce clutter.
But I guess it also serves some purpose like having a glance on what all things available on class without navigating whole class hierarchy. Then, a simple switch to show/hide inheritted stuff would be useful… I guess javadoc follow this approach.

Topic		Replies	Views
Need clarity on "padding" parameter in Bert Tokenizer 🤗Tokenizers	0	486	December 8, 2022
Bert Tokenizer Parameter Possible Values 🤗Transformers	0	250	October 8, 2021
How to pad tokens to a fixed length on a single sentence? Beginners	1	3188	May 19, 2021
Bert strugling with Padded sentence 🤗Transformers	0	386	August 24, 2021
How can I make sure Tokenizer pads to a fixed length? 🤗Tokenizers	2	2116	March 29, 2022

How padding in huggingface tokenizer works?

Related topics