How to use additional input features for NER?

Actually no, because the pre-trained tokenizer only knows tokens, not tokens + POS tags. A better way to do this would be to create an additional input to the model (besides input_ids and token_type_ids) called pos_tag_ids, for which you can add an additional embedding layer (nn.Embedding). In that way, you can sum the embeddings of the tokens, token types and the POS tags. Let’s illustrate this for a pre-trained BERT model:

We first have to modify the BertEmbeddings class. In short, we’ll add an embedding layer for the POS tags:

class BertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        self.pos_tag_embeddings = nn.Embedding(max_number_of_pos_tags, config.hidden_size)

        (...)
  
    def forward(
    self, input_ids=None, pos_tag_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
):
    if input_ids is not None:
        input_shape = input_ids.size()
    else:
        input_shape = inputs_embeds.size()[:-1]

    seq_length = input_shape[1]

    if position_ids is None:
        position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]

    if token_type_ids is None:
        token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

    if inputs_embeds is None:
        inputs_embeds = self.word_embeddings(input_ids)
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
    pos_tag_embeddings = self.pos_tag_embeddings(pos_tag_ids)

    embeddings = inputs_embeds + token_type_embeddings + pos_tag_embeddings
    if self.position_embedding_type == "absolute":
        position_embeddings = self.position_embeddings(position_ids)
        embeddings += position_embeddings
    embeddings = self.LayerNorm(embeddings)
    embeddings = self.dropout(embeddings)
    return embeddings

The max_number_of_pos_tags is the total unique number of POS tags we have (might be 20 for example, with NNP being one of them), also called the “vocabulary size” of the embedding layer. The config.hidden_size is the size of the embedding vector that we want to learn for each POS tag (which is 768 by default for BERT-base). We would also need to modify the forward pass of BertModel a bit to add the additional input pos_tag_ids:

def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        pos_tag_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_values=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):

     (...)

      embedding_output = self.embeddings(
        input_ids=input_ids,
        position_ids=position_ids,
        token_type_ids=token_type_ids,
        pos_tag_ids=pos_tag_ids,
        inputs_embeds=inputs_embeds,
        past_key_values_length=past_key_values_length,
    )
     
     (...)

Now that we have modified the model (modeling_bert.py), let’s move on to provide actual inputs to the model. An additional complexity of BERT-like models is that they rely on subword tokens, rather than words. This means that a word like “Arizona” might be tokenized into [“Ari”, “##zona”]. This means that we will also have to provide POS tags at the token level. Similar to how each token is turn into an integer (input_ids), we will also have to turn each POS tag into a corresponding integer (pos_tag_ids) in order to provide it to the model. So we would actually need to keep a dictionary that maps each POS tag to a corresponding integer.

For simplicity, let’s assume that we only have two POS tags, namely NNP and VNP. We create corresponding integers (pos_tag_ids) for them, for example [0, 1]. So our vocabulary size of the POS tag embedding layer is only 2. Let’s now provide an example sentence to the model:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "She sells"
# if we tokenize it, this becomes:
encoding = tokenizer(text, return_tensors="pt") # this creates a dictionary with keys 'input_ids' etc.
# we add the pos_tag_ids to the dictionary
pos_tags = [NNP, VNP]
encoding['pos_tag_ids'] = torch.tensor([[0, 1]])

# next, we can provide this to our modified BertModel:
from tranformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")
outputs = model(**encoding)

Note that the code above assumes that each word is turned into a single token, which is typically not the case for other words. So suppose that the word Arizona is tokenized into [“Ari”, “##zona”], then we would have pos_tag_ids [0,0] for example.

5 Likes