How to use additional input features for NER?

Wondering if you have the final implementation available?

Hello, I’m also looking for how to add POS features as additional input embeddings. Could you please tell me how to modify the class? Should I modify the .py file directly or there is an API? Thank you.

1 Like

Hi f3n1Xx, I’m still stuck at the same problem and it’s also for a biological problem. There is no code in the link you provided. Would you mind posting here how you added additional features. If you can’t release the entire code could you just show a simplified example please? Thank you!

1 Like

follow the nielsr, I write the code, hope to help you. @barruch @SilasCheung

from transformers.models.bert.modeling_bert import BertEncoder, BertPooler, BertEmbeddings, BaseModelOutputWithPoolingAndCrossAttentions
from transformers import BertConfig, BertModel
import torch
from torch import nn

class BertEmbeddingsV2(BertEmbeddings):
    def __init__(self, config):
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
        max_number_of_pos_tags = 2
        self.pos_tag_embeddings = nn.Embedding(max_number_of_pos_tags, config.hidden_size)
        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")

    def forward(self, input_ids=None, pos_tag_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0):
        if input_ids is not None:
            input_shape = input_ids.size()
            input_shape = inputs_embeds.size()[:-1]

        seq_length = input_shape[1]

        if position_ids is None:
            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]

        if token_type_ids is None:
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)
        pos_tag_embeddings = self.pos_tag_embeddings(pos_tag_ids)

        embeddings = inputs_embeds + token_type_embeddings + pos_tag_embeddings
        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            embeddings += position_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

class BertModelV2(BertModel):
    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
    cross-attention is added between the self-attention layers, following the architecture described in `Attention is
    all you need <>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
    To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration
    set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
    input to the forward pass.

    def __init__(self, config, add_pooling_layer=True):
        self.config = config

        self.embeddings = BertEmbeddingsV2(config)
        self.encoder = BertEncoder(config)

        self.pooler = BertPooler(config) if add_pooling_layer else None

    def forward(
        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
            the model is configured as a decoder.
        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.
        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
        use_cache (:obj:`bool`, `optional`):
            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
            decoding (see :obj:`past_key_values`).
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if self.config.is_decoder:
            use_cache = use_cache if use_cache is not None else self.config.use_cache
            use_cache = False

        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            input_shape = input_ids.size()
            batch_size, seq_length = input_shape
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.size()[:-1]
            batch_size, seq_length = input_shape
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        device = input_ids.device if input_ids is not None else inputs_embeds.device

        # past_key_values_length
        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0

        if attention_mask is None:
            attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)
        if token_type_ids is None:
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)

        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)

        # If a 2D or 3D attention mask is provided for the cross-attention
        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
        if self.config.is_decoder and encoder_hidden_states is not None:
            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
            if encoder_attention_mask is None:
                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
            encoder_extended_attention_mask = None

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

        embedding_output = self.embeddings(
        encoder_outputs = self.encoder(
        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPoolingAndCrossAttentions(

if __name__ == "__main__":
    from transformers import BertTokenizer

    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    text = "She sells"
    # if we tokenize it, this becomes:
    encoding = tokenizer(text, return_tensors="pt") # this creates a dictionary with keys 'input_ids' etc.
    # we add the pos_tag_ids to the dictionary
    # pos_tags = [NNP, VNP]
    encoding['pos_tag_ids'] = torch.tensor([[0, 1, 1, 0]])

    # next, we can provide this to our modified BertModel:
    config = BertConfig()
    model = BertModelV2.from_pretrained("bert-base-uncased", config=config)

    outputs = model(**encoding)

I write the code below, hope to help

OK guys, Let me explain you my issue very clearly. I have two arrays lists. One is list input sentence and other is a list of categories belong to respective sentence.

comment = ["I hate the food","Service is bad","Service is bad but food is good","[![Food in delicious][1]][1]"]
customer_name = ["umar","adil","cameron","Daniel"]

Now its mapping is like this

enter image description here

This is just a sample dataset. I have thousands of records in my dataset.

Actually I am doing sentiment analysis using BERT Model. But now I want to add customer_name in as an additional feature in my BERT Model

Right now this is how I am tokenising my input sentence and generating input_ids and attention_masks

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
input_ids = []
attention_masks = []
for sentence in df['Comment'].tolist():
    dictionary = tokenizer.encode_plus(
                        add_special_tokens = True,
                        max_length = 512,
                        pad_to_max_length = True,
                        return_attention_mask = True,
                        return_tensors = 'pt',
    # encode_plus returns a dictionary 

You can clearly see that I am passing the comment as an input and this is function is return input_ids and attention_masks of each comment and I will create the input_ids layer and attention masks layer and feed it into my BERT Model. But now I want to customer name as an additional feature in my BERT Model.

I asked few experts before and they told me that traditional way is to hold an embedding for each category and concatenate (or any other method to combine features) it with the contextualized output of BERT before you feed it to the classification layer.

I searched about it on internet but this is what I found

# batch with two sentences (i.e. the citation text you have already used) 
i = t(["paper title 1", "paper title 2"], padding=True, return_tensors="pt")

# We assume that the first sentence (i.e. paper title 1) belongs to category 23 and the second sentence to category 42
# You probably want to use a dictionary in your own code 
i["categorical_feature_ids"] = torch.tensor([23,42])

Because I am not an expert in ML, so I have big confusion here

Where we will give our additional feature(Customer Name) as an input. For example, we can see that we are giving our input(Comment) in tokenize method of Bert tokenizer. But what about other features(Customer Name). It’s ok that we can created extra embedding layer for each feature but how will we inject those features (as an input) in our model.

Above piece of code is not enough for me to understand.

Could this also be applied to classification using BERT? I want to classify review text as “fake” or “real” using BERT but want to add additional features such as POS tagging.

Hi @margootje123 ,

yes you can for instance first embed the text using BERT (get the last_hidden_state as this serves as a good text representation), and then concatenate that with additional custom features.

Regarding POS tags, you can use a one-hot encoding or a new nn.Embedding layer and provide those to the classification head.