Is 512 token in bert, token or character level?

moma1820 · April 3, 2022, 8:07am

Hi,
I am super puzzled, if bert 512 token count is character or word level.

So for all the resources say it’s word level.

Would appreciate if someone fills me in .

Best,
Mosh

julien-c · April 3, 2022, 5:57pm

BERT’s tokenizer, Wordpiece, is a subword tokenizer.

See example at Summary of the tokenizers

arteagac · April 3, 2022, 8:31pm

BERT uses a subword tokenizer (WordPiece), so the maximum length corresponds to 512 subword tokens. See the example below, in which the input sentence has eight words, but the tokenizer generates a sequence with length equal to nine. Note that the extra token in this example is due to the word outstandingly, which the subword tokenizer represents using two tokens: outstanding and ##ly.

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.tokenize("The outstandingly big cat sat on the mat")

OUTPUT: 
['the', 'outstanding', '##ly', 'big', 'cat', 'sat', 'on', 'the', 'mat']

moma1820 · April 4, 2022, 6:16am

Thank you for the reply guys,

So lets say i have a 512 word sentece as input. And let’s say some word were tokenized into formats like ‘outstanding’ → ‘outstanding’, ‘##ly’.

So i’m assuming some words at the end of the sentences will be turncated after tokenization ?

Best,
Mosh

Topic		Replies	Views
Help with Tokenizer Word Length Limit Intermediate	2	1630	July 16, 2023
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6658	February 9, 2024
After tokenization, how to get sub-sentence length in a long sentence? Beginners	0	342	October 2, 2022
Add new tokens for subwords 🤗Tokenizers	9	6831	August 11, 2020
Fine-tuning BERT with sequences longer than 512 tokens Models	7	27721	April 4, 2022

Is 512 token in bert, token or character level?

Related topics