Improving NER BERT performing POS tagging

Sergio · November 21, 2020, 4:25pm

Hi everyone,
I’m fine-tuning BERT to perform a NER task.
I’m wondering, if I fine-tune the same BERT model used for NER, to perform a POS tagging task, could the performance of NER task be improved?
To clarify the question:

class EntityModel(nn.Module):
    def __init__(self, num_tag, num_pos):
        super(EntityModel, self).__init__()
        self.num_tag = num_tag
        self.num_pos = num_pos
        self.bert = transformers.BertModel.from_pretrained(config.BASE_MODEL_PATH)
        self.bert_drop_1 = nn.Dropout(0.3)
        self.bert_drop_2 = nn.Dropout(0.3)
        self.out_tag = nn.Linear(768, self.num_tag)
        self.out_pos = nn.Linear(768, self.num_pos)
    
    def forward(self, ids, mask, token_type_ids, target_pos, target_tag):
        o1, _ = self.bert(ids, attention_mask=mask, token_type_ids=token_type_ids)

        bo_tag = self.bert_drop_1(o1)
        bo_pos = self.bert_drop_2(o1)

        tag = self.out_tag(bo_tag)
        pos = self.out_pos(bo_pos)

        loss_tag = loss_fn(tag, target_tag, mask, self.num_tag)
        loss_pos = loss_fn(pos, target_pos, mask, self.num_pos)

        loss = (loss_tag + loss_pos) / 2

        return tag, pos, loss

This is my model. It takes the target_pos, target_entities and the train_data (the sentences) ehich are the same for both the tasks.
For example:

Sentence #,Word,POS,Tag
Sentence: 1:
Thousands,NNS,O
,of,IN,O
,demonstrators,NNS,O
,have,VBP,O
,marched,VBN,O
,through,IN,O
,London,NNP,B-geo
,to,TO,O
,protest,VB,O
,the,DT,O
,war,NN,O
,in,IN,O
,Iraq,NNP,B-geo
,and,CC,O
,demand,VB,O
,the,DT,O
,withdrawal,NN,O
,of,IN,O
,British,JJ,B-gpe
,troops,NNS,O
,from,IN,O
,that,DT,O
,country,NN,O

Two FeedForward layers are used for the POS and NER prediction respectively, but the embeddings come from the same BERT model.
Then, the total loss is calculated as the average loss computed individually from the two tasks.

So, my question is:
Could the NER accuracy/Precision/Recall etc… be improved performing the POS task too?
Many thanks in advance

sgugger · November 23, 2020, 1:18pm

It’s a reasonable assumption. I remember a paper form Sebastian Ruder that showed multitask learners have better performance on the downstream tasks so I would expect this to give better results.
You need to experiment to be sure though

Sergio · November 23, 2020, 3:04pm

Thanks for your answer @sgugger.
May you send to me the link to this paper?
Yeah, the testing is always the best solution, but in my case, to experiment with both solution, I should make a lot of effort to change the code and make the labeling , so I would like to get an “evidence” first, and then take the decision.

sgugger · November 23, 2020, 4:50pm

I don’t remember the name of the paper sorry.

Topic		Replies	Views
Named Entity Recognition: fine-tune or create new model? Beginners	3	3538	February 11, 2023
NER model fine tuning with labeled spans Beginners	5	3905	May 7, 2023
Fine tuning NER BERT model on Phone numbers Beginners	3	1166	May 31, 2024
How to use additional input features for NER? Beginners	27	15956	June 5, 2023
Create custom tags for fine-tuning Bert for NER task 🤗Datasets	0	880	January 22, 2024

Improving NER BERT performing POS tagging

Related topics