Hello,
I am working with a pretrained tokenizer (MiriUll/gpt2-wechsel-german_easy · Hugging Face) that has the bos_token and eos_token set. However, even with adding a custom post-processing, it does not add these special tokens to the tokenization output.
Please have a look at the following code:
I would expect the tokenizer to encode the input as [50258, 7155, 50257], i.e., “<|BOS|> Heute <|EOS|>”.
Where is the error here? How can I tell the tokenizer to put these special tokens so that the model can learn to predict the eos token?
After quite some time, I was able to solve this problem.
So in general, the problem was that the post_processor functionality only exists for tokenizers in the tokenizer library but not in the transformers library. However, python allows to add new attributes nearly anytime, so I did not get an error message that post_processor was not known(and therefore not used).
So if you want to add the custom post processing anyway, you can do it like that:
model_string = "MiriUll/gpt2-wechsel-german_easy"
bos = '<|bos|>'
eos = '<|eos|>'
pad = '<|pad|>'
special_tokens_dict = {'eos_token': eos, 'bos_token': bos, 'pad_token': pad}
tokenizer_orig = AutoTokenizer.from_pretrained(model_string) # transformer library
tokenizer_orig.add_special_tokens(special_tokens_dict) # with this, you don't have to manually define the new tokens' ids
tokenizer = Tokenizer.from_pretrained(model_string) # tokenizer library
tokenizer.post_processor = TemplateProcessing(
single=bos + " $A " + eos,
special_tokens=[(eos, tokenizer_orig.eos_token_id), (bos, tokenizer_orig.bos_token_id)],
)
tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer) #transformer library again but now with post processing
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
P.S. remember to also update the model’s embedding size afterwards, e.g. with this command:
The transformers library tokenizer has a tokenizer._tokenizer object which is a tokenizer from the tokenizers library. So you only need to change the post processor of this object.