If I want to create a tokenizer from scratch, should the special tokens use square brackets with uppercase letters, or angle brackets with lowercase letters?
That is, should cls_token be [CLS] or
BertTokenizerFast uses the former one:
while many other examples on the Internet uses the latter one.
It could be anything you want it to be since you’re making the tokenizer from scratch.
thanks. Also, is
eos_token necessary? In Bert’s implementation, these two tokens are not defined.
It’s not needed for Bert, which does not generate text auto-regressively. So it depends on your application I guess.