How to use additional input features for NER?

nielsr · March 13, 2021, 10:02am

Actually no, because the pre-trained tokenizer only knows tokens, not tokens + POS tags. A better way to do this would be to create an additional input to the model (besides input_ids and token_type_ids) called pos_tag_ids, for which you can add an additional embedding layer (nn.Embedding). In that way, you can sum the embeddings of the tokens, token types and the POS tags. Let’s illustrate this for a pre-trained BERT model:

We first have to modify the BertEmbeddings class. In short, we’ll add an embedding layer for the POS tags:

class BertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        self.pos_tag_embeddings = nn.Embedding(max_number_of_pos_tags, config.hidden_size)

        (...)
  
    def forward(
    self, input_ids=None, pos_tag_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
):
    if input_ids is not None:
        input_shape = input_ids.size()
    else:
        input_shape = inputs_embeds.size()[:-1]

    seq_length = input_shape[1]

    if position_ids is None:
        position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]

    if token_type_ids is None:
        token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

    if inputs_embeds is None:
        inputs_embeds = self.word_embeddings(input_ids)
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
    pos_tag_embeddings = self.pos_tag_embeddings(pos_tag_ids)

    embeddings = inputs_embeds + token_type_embeddings + pos_tag_embeddings
    if self.position_embedding_type == "absolute":
        position_embeddings = self.position_embeddings(position_ids)
        embeddings += position_embeddings
    embeddings = self.LayerNorm(embeddings)
    embeddings = self.dropout(embeddings)
    return embeddings

The max_number_of_pos_tags is the total unique number of POS tags we have (might be 20 for example, with NNP being one of them), also called the “vocabulary size” of the embedding layer. The config.hidden_size is the size of the embedding vector that we want to learn for each POS tag (which is 768 by default for BERT-base). We would also need to modify the forward pass of BertModel a bit to add the additional input pos_tag_ids:

def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        pos_tag_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_values=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):

     (...)

      embedding_output = self.embeddings(
        input_ids=input_ids,
        position_ids=position_ids,
        token_type_ids=token_type_ids,
        pos_tag_ids=pos_tag_ids,
        inputs_embeds=inputs_embeds,
        past_key_values_length=past_key_values_length,
    )
     
     (...)

Now that we have modified the model (modeling_bert.py), let’s move on to provide actual inputs to the model. An additional complexity of BERT-like models is that they rely on subword tokens, rather than words. This means that a word like “Arizona” might be tokenized into [“Ari”, “##zona”]. This means that we will also have to provide POS tags at the token level. Similar to how each token is turn into an integer (input_ids), we will also have to turn each POS tag into a corresponding integer (pos_tag_ids) in order to provide it to the model. So we would actually need to keep a dictionary that maps each POS tag to a corresponding integer.

For simplicity, let’s assume that we only have two POS tags, namely NNP and VNP. We create corresponding integers (pos_tag_ids) for them, for example [0, 1]. So our vocabulary size of the POS tag embedding layer is only 2. Let’s now provide an example sentence to the model:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "She sells"
# if we tokenize it, this becomes:
encoding = tokenizer(text, return_tensors="pt") # this creates a dictionary with keys 'input_ids' etc.
# we add the pos_tag_ids to the dictionary
pos_tags = [NNP, VNP]
encoding['pos_tag_ids'] = torch.tensor([[0, 1]])

# next, we can provide this to our modified BertModel:
from tranformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")
outputs = model(**encoding)

Note that the code above assumes that each word is turned into a single token, which is typically not the case for other words. So suppose that the word Arizona is tokenized into [“Ari”, “##zona”], then we would have pos_tag_ids [0,0] for example.

Topic		Replies	Views
Adding new features to Bert for NER 🤗Transformers	1	1204	May 7, 2022
How to concatenate additional features to the last layer of Bert 🤗Transformers	6	2468	January 14, 2024
Combine BertForSequenceClassificaion with Additional Features 🤗Transformers	3	9533	March 23, 2022
Adding additional features to BERT model Models	0	1053	July 18, 2022
Concatenate non string features to a BERT transformers model Beginners	5	2864	March 27, 2022

How to use additional input features for NER?

Related topics