Actually no, because the pre-trained tokenizer only knows tokens, not tokens + POS tags. A better way to do this would be to create an additional input to the model (besides input_ids
and token_type_ids
) called pos_tag_ids
, for which you can add an additional embedding layer (nn.Embedding
). In that way, you can sum the embeddings of the tokens, token types and the POS tags. Let’s illustrate this for a pre-trained BERT model:
We first have to modify the BertEmbeddings class. In short, we’ll add an embedding layer for the POS tags:
class BertEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings."""
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
self.pos_tag_embeddings = nn.Embedding(max_number_of_pos_tags, config.hidden_size)
(...)
def forward(
self, input_ids=None, pos_tag_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
):
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]
seq_length = input_shape[1]
if position_ids is None:
position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
if token_type_ids is None:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
pos_tag_embeddings = self.pos_tag_embeddings(pos_tag_ids)
embeddings = inputs_embeds + token_type_embeddings + pos_tag_embeddings
if self.position_embedding_type == "absolute":
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
The max_number_of_pos_tags
is the total unique number of POS tags we have (might be 20 for example, with NNP being one of them), also called the “vocabulary size” of the embedding layer. The config.hidden_size
is the size of the embedding vector that we want to learn for each POS tag (which is 768 by default for BERT-base). We would also need to modify the forward pass of BertModel
a bit to add the additional input pos_tag_ids
:
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
pos_tag_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_values=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
(...)
embedding_output = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids,
pos_tag_ids=pos_tag_ids,
inputs_embeds=inputs_embeds,
past_key_values_length=past_key_values_length,
)
(...)
Now that we have modified the model (modeling_bert.py
), let’s move on to provide actual inputs to the model. An additional complexity of BERT-like models is that they rely on subword tokens, rather than words. This means that a word like “Arizona” might be tokenized into [“Ari”, “##zona”]. This means that we will also have to provide POS tags at the token level. Similar to how each token is turn into an integer (input_ids
), we will also have to turn each POS tag into a corresponding integer (pos_tag_ids
) in order to provide it to the model. So we would actually need to keep a dictionary that maps each POS tag to a corresponding integer.
For simplicity, let’s assume that we only have two POS tags, namely NNP and VNP. We create corresponding integers (pos_tag_ids
) for them, for example [0, 1]. So our vocabulary size of the POS tag embedding layer is only 2. Let’s now provide an example sentence to the model:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "She sells"
# if we tokenize it, this becomes:
encoding = tokenizer(text, return_tensors="pt") # this creates a dictionary with keys 'input_ids' etc.
# we add the pos_tag_ids to the dictionary
pos_tags = [NNP, VNP]
encoding['pos_tag_ids'] = torch.tensor([[0, 1]])
# next, we can provide this to our modified BertModel:
from tranformers import BertModel
model = BertModel.from_pretrained("bert-base-uncased")
outputs = model(**encoding)
Note that the code above assumes that each word is turned into a single token, which is typically not the case for other words. So suppose that the word Arizona is tokenized into [“Ari”, “##zona”], then we would have pos_tag_ids
[0,0] for example.