Is 512 token in bert, token or character level?

arteagac · April 3, 2022, 8:31pm

BERT uses a subword tokenizer (WordPiece), so the maximum length corresponds to 512 subword tokens. See the example below, in which the input sentence has eight words, but the tokenizer generates a sequence with length equal to nine. Note that the extra token in this example is due to the word outstandingly, which the subword tokenizer represents using two tokens: outstanding and ##ly.

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.tokenize("The outstandingly big cat sat on the mat")

OUTPUT: 
['the', 'outstanding', '##ly', 'big', 'cat', 'sat', 'on', 'the', 'mat']

Topic		Replies	Views
Add new tokens for subwords 🤗Tokenizers	9	6852	August 11, 2020
Help with Tokenizer Word Length Limit Intermediate	2	1648	July 16, 2023
After tokenization, how to get sub-sentence length in a long sentence? Beginners	0	343	October 2, 2022
Continuation token in pertained tokenizer bert-base-chinese 🤗Tokenizers	0	525	July 11, 2020
Converting Word-level labels to WordPiece-level for Token Classification Intermediate	9	4589	January 13, 2021

Is 512 token in bert, token or character level?

Related topics