Hello,
I have added custom tokens to my tokenizer, which are emojis. This is the code I have used, which adds the new tokens:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
num_added_toks = tokenizer.add_tokens(['👏'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer
Output:
We have added 1 tokens
Embedding(50270, 768)
Though, when I try to tokenize a phrase using this code:
print(tokenizer.tokenize('Congrats 👏'))
I get this output with that strange 'Ġ'
symbol:
['Cong', 'rats', 'Ġ', '👏']