Bert ner classifier

I fine-tune the bert on NER task, and huggingface add a linear classifier on the top of model. I want to know more details about classifier architecture. e.g. fully connected + softmax…

thank you for your help

Hi! Can you be a little bit more specific about your query?

Just to give you a head start,

In general, NER is a sequence labeling (a.k.a token classification) problem.
The additional stuff you may have to consider for NER is, for a word that is divided into multiple tokens by bpe or sentencepiece like model, you use the first token as your reference token that you want to predict. Since all the tokens are connected via self-attention you won’t have problem not predicting the rest of the bpe tokens of a word. In PyTorch, you can ignore computing loss (see ignore_index argument) of those tokens by providing -100 as a label to those tokens (life is so easy with pytorch :wink: ).

Apart from that, I didn’t find any more additional complexity in the training NER model.

Some other implementation details you need to check,

  1. One important Note: So far I remember (please verify), In conll, german or dutch dataset there are 2-3 long sentences in the test dataset. Sequence labeling doesn’t work like sentiment analysis. You need to make sure your sentence is not cut down by the max_sequence_len argument of the Language Model’s tokenizer. Otherwise, you will see a little bit of discrepency in your test F1 sore. An easier hack for this problem is to divide the sentence into smaller parts and predict them one by one and finally merge them.
  2. Imo Self-attention and CRF layer is theoretically different but in application some of the problem that CRF solved in prior model, self-attention can also solve them (because they create a fully connected graph). So using softmax is more preferable than a CRF layer.
  3. The score that the original BERT paper reported are not reproducible and comparable with most of the papers since they used document level NER fine-tuning.

If you still have query about the architecture you can follow this,

you only have to replace hierarchical rnn with transformer as the encoder.

You can check the following paper’s for more info,

  1. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT - ACL Anthology
  4. (beware, this is my publication. I may be biased, still not accepted)

Please let me know if you have more queries.

Thank you very much for your explanation. It let me learning a lot.

I print my model ,then I find it has a classifier. I want to know what is the architecture.

As you can see, the classifier is a single dense layer.
It is probably pointing out from here if you are using BertForSequenceClassification, transformers/ at b29eb247d39b56d903ea36c4f6c272a7bb0c0b4c · huggingface/transformers · GitHub

If you are using BertForTokenClassification, it is pointing out here,

For setting up num_of_label, please change the variable in your config file.

can i think self.classifier = nn.Linear(config.hidden_size, config.num_labels) as a fully-connected layer.
input dimension is config.hidden_size and out dinension is config.num_labels. as shown

yes, this is just a linear layer.