Is 512 token in bert, token or character level?

BERT uses a subword tokenizer (WordPiece), so the maximum length corresponds to 512 subword tokens. See the example below, in which the input sentence has eight words, but the tokenizer generates a sequence with length equal to nine. Note that the extra token in this example is due to the word outstandingly, which the subword tokenizer represents using two tokens: outstanding and ##ly.

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.tokenize("The outstandingly big cat sat on the mat")
OUTPUT: 
['the', 'outstanding', '##ly', 'big', 'cat', 'sat', 'on', 'the', 'mat']
1 Like