I am trying to train a dialog system using GPT2. For tokenization, I am using the following configuration for adding the special tokens.
from transformers import (
AdamW,
AutoConfig,
AutoTokenizer,
PreTrainedModel,
PreTrainedTokenizer,
get_linear_schedule_with_warmup,
)
SPECIAL_TOKENS = {
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "[PAD]",
"additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]"]
}
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
tokenizer.add_special_tokens(SPECIAL_TOKENS)
Next, when I am trying to tokenize a sequence(dialog’s utterance) and later convert into ids, some of the special tokens in my sequence are getting mapped as unknown tokens, since the ids of these special tokens becomes the same as bos and eos as they all map to <|endoftext|> as in the GPT2’s source code.
Here is a working example -
tokenized_sequence = ['[PRED]', 'name', '[SUB]', 'frankie_and_bennys', '[PRED]', 'address', '[SUB]', 'cambridge_leisure_park_clifton_way_cherry_hinton', '[PRED]', 'area', '[SUB]', 'south', '[PRED]', 'food', '[SUB]', 'italian', '[PRED]', 'phone', '[SUB]', '01223_412430', '[PRED]', 'pricerange', '[SUB]', 'expensive', '[PRED]', 'postcode', '[SUB]', 'cb17dy']
special_tokens = ['frankie_and_bennys','cambridge_leisure_park_clifton_way_cherry_hinton','italian','postcode', 'cb17dy']
tokens_to_ids = [50262, 3672, 50261, 50256, 50262, 21975, 50261, 50256, 50262, 20337, 50261, 35782, 50262, 19425, 50261, 50256, 50262, 4862, 50261, 50256, 50262, 50256, 50261, 22031, 50262, 50256, 50261, 50256]
ids_to_tokens = [PRED]name[SUB]<|endoftext|>[PRED]address[SUB]<|endoftext|>[PRED]area[SUB]south[PRED]food[SUB]<|endoftext|>[PRED]phone[SUB]<|endoftext|>[PRED]<|endoftext|>[SUB]expensive[PRED]<|endoftext|>[SUB]<|endoftext|>
As you can see the special_tokens are being mapped to the id 50256 (that is to |endoftext|), the model fails to see and learn these important tokens and hence generate very poor and often hallucinated responses.
What could be a quick and efficient fix for this issue?
Note - I have a large set of such special tokens in my corpus.