My goal is to add a set of starting tokens to a pre-trained AlbertTokenizerFast.
In the Albert Pre-Trained Vocab (SentencePiece Model), all start tokens are preceded with the meta-symbol: ▁ (e.g. ▁hamburger).
I tried adding tokens, prefixed with the meta symbol:
new_tokens = [AddedToken("▁hamburger",), AddedToken("▁pizza")] num_added_tokens = tokenizer.add_tokens(new_tokens)
However, as this forum post shows, input text to AddedToken is treated literally; so manually adding the meta-symbol prefixes doesn’t achieve the desired effect.
Instead, I tried using the single_word parameter:
new_tokens = [AddedToken("hamburger", single_word=True, lstrip=True), AddedToken("pizza", single_word=True, lstrip=True)] num_added_tokens = tokenizer.add_tokens(new_tokens)
This solution successfully encodes the new tokens where hamburger is being encoded by token 30001:
tokenizer('This hamburger tastes great') >> [2, 15, 30001, 53, 8, 345,3]
However, when I try to decode these ids, no space appears between “this” and “hamburger”:
tokenizer.decode([2, 15, 30001, 53, 8, 345,3]) >> ('Thishamburger tastes great')
I was wondering if anybody had any thoughts about how to fix this.