I used the ESM2 model and tried to add a new token using the code below. But the added token is always assigned special token despite with the special_tokens=False option. I tested the code on Bert models and everything is ok. Could be ESM2 specific.
model_checkpoint = “facebook/esm2_t6_8M_UR50D”
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
num_added_toks = tokenizer.add_tokens([‘J’],special_tokens=False)
print(“We have added”, num_added_toks, “tokens”)
model.resize_token_embeddings(len(tokenizer))
The vocab output is below:
<bound method EsmTokenizer.get_vocab of EsmTokenizer(name_or_path=‘facebook/esm2_t6_8M_UR50D’, vocab_size=33, model_max_length=1024, is_fast=False, padding_side=‘right’, truncation_side=‘right’, special_tokens={‘eos_token’: ‘’, ‘unk_token’: ‘’, ‘pad_token’: ‘’, ‘cls_token’: ‘’, ‘mask_token’: ‘’, ‘additional_special_tokens’: [‘J’]}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken(“”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
1: AddedToken(“”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
2: AddedToken(“”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
3: AddedToken(“”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32: AddedToken(“”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
33: AddedToken(“J”, rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}>