Hi All,
My goal is to add a set of starting tokens to a pre-trained AlbertTokenizerFast.
In the Albert Pre-Trained Vocab (SentencePiece Model), all start tokens are preceded with the meta-symbol: ā (e.g. āhamburger).
I tried adding tokens, prefixed with the meta symbol:
new_tokens = [AddedToken("āhamburger",), AddedToken("āpizza")]
num_added_tokens = tokenizer.add_tokens(new_tokens)
However, as this forum post shows, input text to AddedToken is treated literally; so manually adding the meta-symbol prefixes doesnāt achieve the desired effect.
Instead, I tried using the single_word parameter:
new_tokens = [AddedToken("hamburger", single_word=True, lstrip=True), AddedToken("pizza", single_word=True, lstrip=True)]
num_added_tokens = tokenizer.add_tokens(new_tokens)
This solution successfully encodes the new tokens where hamburger is being encoded by token 30001:
tokenizer('This hamburger tastes great')
>> [2, 15, 30001, 53, 8, 345,3]
However, when I try to decode these ids, no space appears between āthisā and āhamburgerā:
tokenizer.decode([2, 15, 30001, 53, 8, 345,3])
>> ('Thishamburger tastes great')
I was wondering if anybody had any thoughts about how to fix this.