Hi,
I am super puzzled, if bert 512 token count is character or word level.
So for all the resources say it’s word level.
Would appreciate if someone fills me in .
Best,
Mosh
Hi,
I am super puzzled, if bert 512 token count is character or word level.
So for all the resources say it’s word level.
Would appreciate if someone fills me in .
Best,
Mosh
BERT’s tokenizer, Wordpiece, is a subword tokenizer.
See example at Summary of the tokenizers
BERT uses a subword tokenizer (WordPiece), so the maximum length corresponds to 512 subword tokens. See the example below, in which the input sentence has eight words, but the tokenizer generates a sequence with length equal to nine. Note that the extra token in this example is due to the word outstandingly
, which the subword tokenizer represents using two tokens: outstanding
and ##ly
.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.tokenize("The outstandingly big cat sat on the mat")
OUTPUT:
['the', 'outstanding', '##ly', 'big', 'cat', 'sat', 'on', 'the', 'mat']
Thank you for the reply guys,
So lets say i have a 512 word sentece as input. And let’s say some word were tokenized into formats like ‘outstanding’ → ‘outstanding’, ‘##ly’.
So i’m assuming some words at the end of the sentences will be turncated after tokenization ?
Best,
Mosh