BERT uses a subword tokenizer (WordPiece), so the maximum length corresponds to 512 subword tokens. See the example below, in which the input sentence has eight words, but the tokenizer generates a sequence with length equal to nine. Note that the extra token in this example is due to the word outstandingly
, which the subword tokenizer represents using two tokens: outstanding
and ##ly
.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.tokenize("The outstandingly big cat sat on the mat")
OUTPUT:
['the', 'outstanding', '##ly', 'big', 'cat', 'sat', 'on', 'the', 'mat']