Improving NER BERT performing POS tagging

Hi everyone,
I’m fine-tuning BERT to perform a NER task.
I’m wondering, if I fine-tune the same BERT model used for NER, to perform a POS tagging task, could the performance of NER task be improved?
To clarify the question:

class EntityModel(nn.Module):
    def __init__(self, num_tag, num_pos):
        super(EntityModel, self).__init__()
        self.num_tag = num_tag
        self.num_pos = num_pos
        self.bert = transformers.BertModel.from_pretrained(config.BASE_MODEL_PATH)
        self.bert_drop_1 = nn.Dropout(0.3)
        self.bert_drop_2 = nn.Dropout(0.3)
        self.out_tag = nn.Linear(768, self.num_tag)
        self.out_pos = nn.Linear(768, self.num_pos)
    
    def forward(self, ids, mask, token_type_ids, target_pos, target_tag):
        o1, _ = self.bert(ids, attention_mask=mask, token_type_ids=token_type_ids)

        bo_tag = self.bert_drop_1(o1)
        bo_pos = self.bert_drop_2(o1)

        tag = self.out_tag(bo_tag)
        pos = self.out_pos(bo_pos)

        loss_tag = loss_fn(tag, target_tag, mask, self.num_tag)
        loss_pos = loss_fn(pos, target_pos, mask, self.num_pos)

        loss = (loss_tag + loss_pos) / 2

        return tag, pos, loss

This is my model. It takes the target_pos, target_entities and the train_data (the sentences) ehich are the same for both the tasks.
For example:

Sentence #,Word,POS,Tag
Sentence: 1:
Thousands,NNS,O
,of,IN,O
,demonstrators,NNS,O
,have,VBP,O
,marched,VBN,O
,through,IN,O
,London,NNP,B-geo
,to,TO,O
,protest,VB,O
,the,DT,O
,war,NN,O
,in,IN,O
,Iraq,NNP,B-geo
,and,CC,O
,demand,VB,O
,the,DT,O
,withdrawal,NN,O
,of,IN,O
,British,JJ,B-gpe
,troops,NNS,O
,from,IN,O
,that,DT,O
,country,NN,O

Two FeedForward layers are used for the POS and NER prediction respectively, but the embeddings come from the same BERT model.
Then, the total loss is calculated as the average loss computed individually from the two tasks.

So, my question is:
Could the NER accuracy/Precision/Recall etc… be improved performing the POS task too?
Many thanks in advance :slight_smile:

1 Like

It’s a reasonable assumption. I remember a paper form Sebastian Ruder that showed multitask learners have better performance on the downstream tasks so I would expect this to give better results.
You need to experiment to be sure though :wink:

1 Like

Thanks for your answer @sgugger.
May you send to me the link to this paper?
Yeah, the testing is always the best solution, but in my case, to experiment with both solution, I should make a lot of effort to change the code and make the labeling :sweat_smile:, so I would like to get an “evidence” first, and then take the decision.

I don’t remember the name of the paper sorry.