Fine Tuned GPT2 model performs very poorly on token classification task

Dhanachandra · January 31, 2022, 3:51pm

Environment info

transformers version: 4.17.0.dev0
Platform: Linux-5.4.144±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.12
PyTorch version (GPU?): 1.10.0+cu111 (True)
Tensorflow version (GPU?): 2.7.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

###Models:

GPT-2: @patrickvonplaten, @LysandreJik

Information

Model I am using gpt2:

The problem arises when using:

[run_ner ] the official example scripts:

The tasks I am working on is:
Token classification on ncbi_disease dataset

To reproduce

python run_ner.py
–model_name_or_path gpt2
–dataset_name ncbi_disease
–output_dir /tmp/test-ner
–do_train
–do_eval
The f1-score after fine-tuning for 3 epoch is only 0.6114 whereas roberta-base provides f1 score of 0.8358 on the evaluation dataset.

marshmellow77 · January 31, 2022, 4:29pm

Hi @Dhanachandra , thanks for starting this thread.

I would expect GPT models to perform worse than, for example, a Roberta model for token classification (aka NER) because of how GPT models are trained. From what I understand, GPT models are causal LMs, meaning they are trained via next token classification. Roberta, on the other hand, is a masked LM, meaning they are trained on predicting masked token in a text. The difference is that GPT models don’t have “access” to the tokens on the right, whereas Roberta does. That makes GPT models great for text generation, but probably not as good for token classification.

Is there a particular reason you need to use GPT model for your NER problem, or could you also use other models like Roberta?

Cheers
Heiko

Dhanachandra · February 1, 2022, 10:25am

Hi @marshmellow77, thanks for your quick response. Actually, my work involves finding the best language model suited for multiple tasks like NER, Relation extraction, document classification, text summarization.

marshmellow77 · February 1, 2022, 11:07am

Hm, I’m a bit sceptical. These are quite different tasks and should probably be tackled by different models. Is there a particular reason you need one model to rule them all?

For example, encoder-only models (BERT et al) still dominate research and industry on NLU tasks such as text classification, named entity recognition, and question answering, while decoder-only models (GPT et al) are exceptionally good at predicting the next word in a sequence and are thus mostly used for text generation tasks.

If you really want one model to rule them all I’d suggest to look at encoder-decoder models (e.g. T5, BART, BigBird).

Dhanachandra · February 1, 2022, 3:07pm

Really appreciate your suggestions.
The reason why I am looking for a single model that can conquer all the problems mentioned above is that I have to utilize the model in a different domain (i.e in clinical). And I will further pre-train the model with domain-specific data. As you know this pre-training process is very expensive, I am looking for a single model for all the problems. Also, I have already worked with bert by pre-training on a domain-specific dataset but the result was not promising.

Will look into the encoder-decoder models that you are suggesting.

Topic		Replies	Views
Code is working fine for Bert and Roberta However Fails During GPTNeo Beginners	2	291	February 27, 2024
NLI 2-sentence classification with GPT2, XLNet, etc.? 🤗Transformers	2	1944	September 9, 2020
Getting different sentence embeddings when using model on CPU and GPU Beginners	0	2299	August 26, 2022
Fine-tuned transformers model generats nonsensical results Beginners	0	217	July 10, 2024
Ask for help with prediction results of Named Entity Recognition Task 🤗Transformers	10	3229	May 21, 2021

Fine Tuned GPT2 model performs very poorly on token classification task

Environment info

Information

To reproduce

Related topics