Fine Tuned GPT2 model performs very poorly on token classification task

Environment info

  • transformers version: 4.17.0.dev0
  • Platform: Linux-5.4.144±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.12
  • PyTorch version (GPU?): 1.10.0+cu111 (True)
  • Tensorflow version (GPU?): 2.7.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:



Model I am using gpt2:

The problem arises when using:

  • [run_ner ] the official example scripts:

The tasks I am working on is:
Token classification on ncbi_disease dataset

To reproduce

–model_name_or_path gpt2
–dataset_name ncbi_disease
–output_dir /tmp/test-ner
The f1-score after fine-tuning for 3 epoch is only 0.6114 whereas roberta-base provides f1 score of 0.8358 on the evaluation dataset.

Hi @Dhanachandra , thanks for starting this thread.

I would expect GPT models to perform worse than, for example, a Roberta model for token classification (aka NER) because of how GPT models are trained. From what I understand, GPT models are causal LMs, meaning they are trained via next token classification. Roberta, on the other hand, is a masked LM, meaning they are trained on predicting masked token in a text. The difference is that GPT models don’t have “access” to the tokens on the right, whereas Roberta does. That makes GPT models great for text generation, but probably not as good for token classification.

Is there a particular reason you need to use GPT model for your NER problem, or could you also use other models like Roberta?


Hi @marshmellow77, thanks for your quick response. Actually, my work involves finding the best language model suited for multiple tasks like NER, Relation extraction, document classification, text summarization.

Hm, I’m a bit sceptical. These are quite different tasks and should probably be tackled by different models. Is there a particular reason you need one model to rule them all?

For example, encoder-only models (BERT et al) still dominate research and industry on NLU tasks such as text classification, named entity recognition, and question answering, while decoder-only models (GPT et al) are exceptionally good at predicting the next word in a sequence and are thus mostly used for text generation tasks.

If you really want one model to rule them all I’d suggest to look at encoder-decoder models (e.g. T5, BART, BigBird).

Really appreciate your suggestions.
The reason why I am looking for a single model that can conquer all the problems mentioned above is that I have to utilize the model in a different domain (i.e in clinical). And I will further pre-train the model with domain-specific data. As you know this pre-training process is very expensive, I am looking for a single model for all the problems. Also, I have already worked with bert by pre-training on a domain-specific dataset but the result was not promising.

Will look into the encoder-decoder models that you are suggesting.