Should cls_token be [CLS] or <cls>?

raptorkwok · October 11, 2023, 7:48am

If I want to create a tokenizer from scratch, should the special tokens use square brackets with uppercase letters, or angle brackets with lowercase letters?

That is, should cls_token be [CLS] or <cls>?

BertTokenizerFast uses the former one:

github.com

huggingface/transformers/blob/599db139f921f3af535052c860cb685cadae6fcd/src/transformers/tokenization_bert.py#L547


      
          
          def __init__(
              self,
              vocab_file,
              do_lower_case=True,
              do_basic_tokenize=True,
              never_split=None,
              unk_token="[UNK]",
              sep_token="[SEP]",
              pad_token="[PAD]",
              cls_token="[CLS]",
              mask_token="[MASK]",
              tokenize_chinese_chars=True,
              max_length=None,
              pad_to_max_length=False,
              stride=0,
              truncation_strategy="longest_first",
              add_special_tokens=True,
              **kwargs
          ):
              super(BertTokenizerFast, self).__init__(

while many other examples on the Internet uses the latter one.

Sandy1857 · October 11, 2023, 8:34am

It could be anything you want it to be since you’re making the tokenizer from scratch.

raptorkwok · October 11, 2023, 8:44am

thanks. Also, is bos_token and eos_token necessary? In Bert’s implementation, these two tokens are not defined.

Sandy1857 · October 11, 2023, 9:23am

It’s not needed for Bert, which does not generate text auto-regressively. So it depends on your application I guess.

Topic		Replies	Views
Does AutoTokenizer.from_pretrained add [cls] tokens? 🤗Tokenizers	7	5285	March 2, 2021
How is CLS special token embedding initialized? Intermediate	1	2773	March 16, 2022
Continuation token in pertained tokenizer bert-base-chinese 🤗Tokenizers	0	521	July 11, 2020
DistilBERT and CLS token Beginners	2	2449	February 21, 2021
Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library 🤗Tokenizers	1	1092	August 30, 2021

Should cls_token be [CLS] or <cls>?

Related topics