GPT2Tokenizer not putting bos/eos token

Hello,
I am working with a pretrained tokenizer (MiriUll/gpt2-wechsel-german_easy · Hugging Face) that has the bos_token and eos_token set. However, even with adding a custom post-processing, it does not add these special tokens to the tokenization output.
Please have a look at the following code:

tokenizer = AutoTokenizer.from_pretrained("MiriUll/gpt2-wechsel-german_easy")
print(tokenizer.eos_token, tokenizer.eos_token_id)  # prints "<|EOS|> 50257" as expected
bos = tokenizer.bos_token
eos = tokenizer.eos_token
tokenizer.post_processor = TemplateProcessing(
    single=bos + " $A " + eos,
    pair=bos+" $A $B "+eos,
    special_tokens=[(eos, tokenizer.eos_token_id), (bos, tokenizer.bos_token_id)],
)
input_text = "Heute"
encoding = tokenizer(input_text, return_tensors="pt", add_special_tokens=True)['input_ids']
print(encoding)  # prints tensor([[7155]])

I would expect the tokenizer to encode the input as [50258, 7155, 50257], i.e., “<|BOS|> Heute <|EOS|>”.
Where is the error here? How can I tell the tokenizer to put these special tokens so that the model can learn to predict the eos token?

After quite some time, I was able to solve this problem.
So in general, the problem was that the post_processor functionality only exists for tokenizers in the tokenizer library but not in the transformers library. However, python allows to add new attributes nearly anytime, so I did not get an error message that post_processor was not known(and therefore not used).

So if you want to add the custom post processing anyway, you can do it like that:

model_string = "MiriUll/gpt2-wechsel-german_easy"
bos = '<|bos|>'
eos = '<|eos|>'
pad = '<|pad|>'
special_tokens_dict = {'eos_token': eos, 'bos_token': bos, 'pad_token': pad}

tokenizer_orig = AutoTokenizer.from_pretrained(model_string) # transformer library
tokenizer_orig.add_special_tokens(special_tokens_dict) # with this, you don't have to manually define the new tokens' ids
tokenizer = Tokenizer.from_pretrained(model_string) # tokenizer library
tokenizer.post_processor = TemplateProcessing(
    single=bos + " $A " + eos,
    special_tokens=[(eos, tokenizer_orig.eos_token_id), (bos, tokenizer_orig.bos_token_id)],
)
tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer) #transformer library again but now with post processing
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

P.S. remember to also update the model’s embedding size afterwards, e.g. with this command:

model.resize_token_embeddings(len(tokenizer))
3 Likes

I found an easier way to achieve the same thing.

The transformers library tokenizer has a tokenizer._tokenizer object which is a tokenizer from the tokenizers library. So you only need to change the post processor of this object.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_string)
tokenizer._tokenizer.post_processor = TemplateProcessing(
    single=tokenizer.bos_token + " $A " + tokenizer.eos_token,
    special_tokens=[(tokenizer.eos_token, tokenizer.eos_token_id), (tokenizer.bos_token, tokenizer.bos_token_id)],
)
5 Likes

It really worked. Thanks.