I tried to add an extra dimension to the Huggingface pre-trained BERT tokenizer. The extra column represents the extra label. For example, if the original embedding of the word “dog” was [1,1,1,1,1,1,1], then I might add a special column with index 2 to represent ‘noun’. Thus, the new embedding becomes [1,1,1,1,1,1,1,2]. Then, I will feed the new input [1,1,1,1,1,1,1,2] into the Bert model. How can I do this in Huggingface?
There is something called tokenizer.add_special_tokens which extends the original vocabulary with new tokens. However, I want to concatenate the embedding of the original vocabulary with the embedding of the tokenizer. For example, I want the Bert model to understand that Dog is a noun by connecting the embedding of dog to the embedding of noun. Should I even change the input word embedding of a pre-trained model? Or should I somehow enhance the attention on “dog” and “noun” in the middle layer?
Here is the example of using tokenizer.add_special_tokens
tokenizer = GPT2Tokenizer.from_pretrained(‘gpt2’)
model = GPT2Model.from_pretrained(‘gpt2’)special_tokens_dict = {‘cls_token’: ‘’}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print(‘We have added’, num_added_toks, ‘tokens’)
model.resize_token_embeddings(len(tokenizer))assert tokenizer.cls_token == ‘’