How to use additional input features for NER?

Actually, my tutorial was a bit simplistic (can’t seem to edit my tutorial above). Let’s take a more realistic example. Suppose that you have a list of words like [“My”, “name”, “is”, “Niels”], and the corresponding POS tags are [DET, NOUN, AUX, PROPN]. Here’s how to prepare the additional input features:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

words = ["My", "name", "is", "Niels"]
pos_tags = ["DET", "NOUN", "AUX", "PROPN"]


tag2id = {'NA': 0, 'DET': 1, 'NOUN':2, 'AUX':3, 'PROPN':4}
id2tag = {v:k for k,v in tag2id.items()}

tokens = []
pos_tag_tokens = []
for word, tag in zip(words, pos_tags):
  # tokenize the word
  word_tokens = tokenizer.tokenize(word)
  tokens.extend(word_tokens)
  # copy the POS tag for all word tokens
  pos_tag_tokens.extend([tag for _ in range(len(word_tokens))])

# Truncation: account for [CLS] and [SEP] with "- 2". 
special_tokens_count = 2 
max_seq_length = 512
if len(tokens) > max_seq_length - special_tokens_count:
    tokens = tokens[: (max_seq_length - special_tokens_count)]
    pos_tag_tokens = pos_tags_tokens[: (max_seq_length - special_tokens_count)]

# add special tokens + corresponding POS tags
tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token]
pos_tag_tokens = ['NA'] + pos_tag_tokens + ['NA']

# create input_ids + attention_mask
input_ids = tokenizer.convert_tokens_to_ids(tokens)
attention_mask = [1] * len(input_ids)
print(pos_tag_tokens)
pos_tag_ids = [tag2id[tag] for tag in pos_tag_tokens]

# padding up to max_seq_length
padding_length = max_seq_length - len(input_ids)
input_ids += [tokenizer.pad_token_id] * padding_length
attention_mask += [0] * padding_length
pos_tag_ids += [0] * padding_length

print(tokenizer.convert_ids_to_tokens(input_ids))
print(pos_tag_ids)

In reality, we also need to add POS tag IDs for special tokens ([CLS], [SEP] and [PAD]) - I’m setting the POS tag id for the special tokens to 0, which means “NA” (not applicable). Moreover, it is possible that a word is tokenized into several tokens, hence we must create these features for each of the tokens of a given word.

Now we can give this as input to the model:

from transformers import BertForTokenClassification
import torch

model = BertForTokenClassification.from_pretrained("bert-base-uncased")

input_ids = torch.tensor(input_ids).unsqueeze(0) # batch size of 1
attention_mask = torch.tensor(attention_mask).unsqueeze(0) # batch size of 1
pos_tag_ids = torch.tensor(pos_tag_ids).unsqueeze(0) # batch size of 1

outputs = model(input_ids=input_ids, attention_mask=attention_mask, pos_tag_ids=pos_tag_ids)
4 Likes