Bert ner classifier

yucheng · April 29, 2021, 9:02am

hi,
I fine-tune the bert on NER task, and huggingface add a linear classifier on the top of model. I want to know more details about classifier architecture. e.g. fully connected + softmax…

thank you for your help

sbmaruf · April 29, 2021, 3:47pm

Hi! Can you be a little bit more specific about your query?

Just to give you a head start,

In general, NER is a sequence labeling (a.k.a token classification) problem.
The additional stuff you may have to consider for NER is, for a word that is divided into multiple tokens by bpe or sentencepiece like model, you use the first token as your reference token that you want to predict. Since all the tokens are connected via self-attention you won’t have problem not predicting the rest of the bpe tokens of a word. In PyTorch, you can ignore computing loss (see ignore_index argument) of those tokens by providing -100 as a label to those tokens (life is so easy with pytorch ).

Apart from that, I didn’t find any more additional complexity in the training NER model.

Some other implementation details you need to check,

One important Note: So far I remember (please verify), In conll, german or dutch dataset there are 2-3 long sentences in the test dataset. Sequence labeling doesn’t work like sentiment analysis. You need to make sure your sentence is not cut down by the max_sequence_len argument of the Language Model’s tokenizer. Otherwise, you will see a little bit of discrepency in your test F1 sore. An easier hack for this problem is to divide the sentence into smaller parts and predict them one by one and finally merge them.
Imo Self-attention and CRF layer is theoretically different but in application some of the problem that CRF solved in prior model, self-attention can also solve them (because they create a fully connected graph). So using softmax is more preferable than a CRF layer.
The score that the original BERT paper reported are not reproducible and comparable with most of the papers since they used document level NER fine-tuning.

If you still have query about the architecture you can follow this,

you only have to replace hierarchical rnn with transformer as the encoder.

You can check the following paper’s for more info,

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT - ACL Anthology
https://arxiv.org/pdf/2007.07683.pdf
https://arxiv.org/pdf/2004.12440.pdf
https://arxiv.org/pdf/2004.13240.pdf (beware, this is my publication. I may be biased, still not accepted)

Please let me know if you have more queries.

yucheng · April 30, 2021, 2:57am

Thank you very much for your explanation. It let me learning a lot.

I print my model ,then I find it has a classifier. I want to know what is the architecture.

sbmaruf · April 30, 2021, 3:14am

As you can see, the classifier is a single dense layer.
It is probably pointing out from here if you are using BertForSequenceClassification, transformers/modeling_bert.py at b29eb247d39b56d903ea36c4f6c272a7bb0c0b4c · huggingface/transformers · GitHub

If you are using BertForTokenClassification, it is pointing out here,

github.com

huggingface/transformers/blob/b29eb247d39b56d903ea36c4f6c272a7bb0c0b4c/src/transformers/models/bert/modeling_bert.py#L1648


class BertForTokenClassification(BertPreTrainedModel):


    _keys_to_ignore_on_load_unexpected = [r"pooler"]


    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels


        self.bert = BertModel(config, add_pooling_layer=False)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)


        self.init_weights()


    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint=_CHECKPOINT_FOR_DOC,
        output_type=TokenClassifierOutput,
        config_class=_CONFIG_FOR_DOC,
    )

For setting up num_of_label, please change the variable in your config file.

yucheng · May 3, 2021, 7:21am

hi,
can i think self.classifier = nn.Linear(config.hidden_size, config.num_labels) as a fully-connected layer.
input dimension is config.hidden_size and out dinension is config.num_labels. as shown

sbmaruf · May 3, 2021, 11:54am

yes, this is just a linear layer.

Topic		Replies	Views
NER for chunks / sentences 🤗Transformers	4	2359	February 12, 2021
Token Classification Label order Intermediate	0	566	November 11, 2022
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12831	February 12, 2024
How to deal with differences between CoNLL 2003 dataset tokenisation and BER tokeniser when fine tuning NER model? Intermediate	6	2719	November 23, 2021
BERT for NER output of only '0' Beginners	0	670	November 14, 2021

Bert ner classifier

Related topics