How to efficiently tokenize unknown tokens in GPT2

sosaho10 · January 12, 2022, 4:12pm

I am trying to train a dialog system using GPT2. For tokenization, I am using the following configuration for adding the special tokens.

from transformers import (
     AdamW,
     AutoConfig,
     AutoTokenizer,
     PreTrainedModel,
     PreTrainedTokenizer,
     get_linear_schedule_with_warmup,
)

SPECIAL_TOKENS = {
    "bos_token": "<|endoftext|>",
    "eos_token": "<|endoftext|>",
    "pad_token": "[PAD]",
    "additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]"]
}
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
tokenizer.add_special_tokens(SPECIAL_TOKENS)

Next, when I am trying to tokenize a sequence(dialog’s utterance) and later convert into ids, some of the special tokens in my sequence are getting mapped as unknown tokens, since the ids of these special tokens becomes the same as bos and eos as they all map to <|endoftext|> as in the GPT2’s source code.

Here is a working example -

tokenized_sequence = ['[PRED]', 'name', '[SUB]', 'frankie_and_bennys', '[PRED]', 'address', '[SUB]', 'cambridge_leisure_park_clifton_way_cherry_hinton', '[PRED]', 'area', '[SUB]', 'south', '[PRED]', 'food', '[SUB]', 'italian', '[PRED]', 'phone', '[SUB]', '01223_412430', '[PRED]', 'pricerange', '[SUB]', 'expensive', '[PRED]', 'postcode', '[SUB]', 'cb17dy']
special_tokens = ['frankie_and_bennys','cambridge_leisure_park_clifton_way_cherry_hinton','italian','postcode', 'cb17dy']
tokens_to_ids = [50262, 3672, 50261, 50256, 50262, 21975, 50261, 50256, 50262, 20337, 50261, 35782, 50262, 19425, 50261, 50256, 50262, 4862, 50261, 50256, 50262, 50256, 50261, 22031, 50262, 50256, 50261, 50256]
ids_to_tokens = [PRED]name[SUB]<|endoftext|>[PRED]address[SUB]<|endoftext|>[PRED]area[SUB]south[PRED]food[SUB]<|endoftext|>[PRED]phone[SUB]<|endoftext|>[PRED]<|endoftext|>[SUB]expensive[PRED]<|endoftext|>[SUB]<|endoftext|>

As you can see the special_tokens are being mapped to the id 50256 (that is to |endoftext|), the model fails to see and learn these important tokens and hence generate very poor and often hallucinated responses.

What could be a quick and efficient fix for this issue?

Note - I have a large set of such special tokens in my corpus.

Topic		Replies	Views
Error with <\|endoftext\|> in Tokenizer GPT2 🤗Tokenizers	2	7515	December 16, 2020
transformers.Tokenizer produce unexpected results 🤗Transformers	0	208	April 26, 2023
Can't load tokenizer with added special tokens 🤗Transformers	0	839	March 29, 2022
How can I change the token id of a special token? 🤗Tokenizers	0	889	January 6, 2022
GPT-2 special tokens Models	2	2116	February 20, 2024

How to efficiently tokenize unknown tokens in GPT2

Related topics