How to concatenate the word embedding for special tokens and words

xuan4470 · June 13, 2021, 4:53am

I tried to add an extra dimension to the Huggingface pre-trained BERT tokenizer. The extra column represents the extra label. For example, if the original embedding of the word “dog” was [1,1,1,1,1,1,1], then I might add a special column with index 2 to represent ‘noun’. Thus, the new embedding becomes [1,1,1,1,1,1,1,2]. Then, I will feed the new input [1,1,1,1,1,1,1,2] into the Bert model. How can I do this in Huggingface?

There is something called tokenizer.add_special_tokens which extends the original vocabulary with new tokens. However, I want to concatenate the embedding of the original vocabulary with the embedding of the tokenizer. For example, I want the Bert model to understand that Dog is a noun by connecting the embedding of dog to the embedding of noun. Should I even change the input word embedding of a pre-trained model? Or should I somehow enhance the attention on “dog” and “noun” in the middle layer?

Here is the example of using tokenizer.add_special_tokens

tokenizer = GPT2Tokenizer.from_pretrained(‘gpt2’)
model = GPT2Model.from_pretrained(‘gpt2’)

special_tokens_dict = {‘cls_token’: ‘’}

num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print(‘We have added’, num_added_toks, ‘tokens’)
model.resize_token_embeddings(len(tokenizer))

assert tokenizer.cls_token == ‘’

xuan4470 · June 13, 2021, 8:00am

I found solution here : How to use additional input features for NER?

Topic		Replies	Views
How to train the embedding of special token? Intermediate	1	3550	October 17, 2021
Is it OK to get word embedding without adding special tokens? Beginners	3	1255	April 15, 2023
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3687	April 21, 2021
Adding a new mask_token for BERT-like models/tokenizers Intermediate	0	502	May 26, 2023
How to create a Huggingface tokenizer from a non-Huggingface tokenizer? 🤗Tokenizers	0	468	May 4, 2021

How to concatenate the word embedding for special tokens and words

Related topics