Is 512 token in bert, token or character level?

I am super puzzled, if bert 512 token count is character or word level.

So for all the resources say it’s word level.

Would appreciate if someone fills me in .


BERT’s tokenizer, Wordpiece, is a subword tokenizer.

See example at Summary of the tokenizers

1 Like

BERT uses a subword tokenizer (WordPiece), so the maximum length corresponds to 512 subword tokens. See the example below, in which the input sentence has eight words, but the tokenizer generates a sequence with length equal to nine. Note that the extra token in this example is due to the word outstandingly, which the subword tokenizer represents using two tokens: outstanding and ##ly.

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.tokenize("The outstandingly big cat sat on the mat")
['the', 'outstanding', '##ly', 'big', 'cat', 'sat', 'on', 'the', 'mat']
1 Like

Thank you for the reply guys,

So lets say i have a 512 word sentece as input. And let’s say some word were tokenized into formats like ‘outstanding’ → ‘outstanding’, ‘##ly’.

So i’m assuming some words at the end of the sentences will be turncated after tokenization ?