Fine-tune TF-XML-ROBERTa for token classification

Constantin · April 13, 2021, 12:30pm

Hello

I would like to finetune TF-XML-RoBERTa for token classification with custom dataset. Quite simple task, I know. But I have some problems with it…

I start with defining Tokenizer

tokenizer = AutoTokenizer.from_pretrained('jplu/tf-xlm-roberta-large')

Firstly, I convert conllu2003 dataset into tf.Dataset. After it, I have the dataset in the following format:

({'input_ids': array([[    0,  3747,   456, 75161,     7, 30839, 11782,    47, 25299,
      47924,    18, 56101,    21,  6492,     6,     5,     2,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0]], dtype=int32),
  'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0]], dtype=int32)},
 array([[0, 3, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]], dtype=int32))

Secondly, I build model based on pretrained ‘jplu/tf-xlm-roberta-large’. I have done it with two a little bit different ways. The first one is to apply default TFXLMRobertaForTokenClassification:

model = TFXLMRobertaForTokenClassification.from_pretrained( 'jplu/tf-xlm-roberta-large',num_labels=len(labels))

And train it.

model.fit(tfdataset_train)
14041/14041 [==============================] - 2700s 191ms/step - loss: 0.4062 - accuracy: 0.9713

Finally evaluate it

benchmarks = model.evaluate(tfdataset_test, return_dict=True, batch_size=2)
3453/3453 [==============================] - 127s 36ms/step - loss: 0.4149 - accuracy: 0.9743

Look like great accuracy, but in fact when I run model on some example from train data it just returns almost same logits for each token! It predicts non named entity class for each token!

 res = model(next(tfdataset_train.batch(1).as_numpy_iterator())[0]).logits
for i in range(15):
     print(res[0][i])

tf.Tensor(
[ 2.3191597 -2.450158 -4.1209745 -5.032844 -1.676107 -8.229665
-5.121443 -1.2029874 -4.7935658], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.3191764 -2.4501915 -4.120973 -5.032861 -1.6761211 -8.229721
-5.121465 -1.2030123 -4.793587 ], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.3191643 -2.450162 -4.120951 -5.032834 -1.6761119 -8.229675
-5.121448 -1.2029958 -4.7935658], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.3191674 -2.4501798 -4.1209555 -5.032833 -1.67613 -8.229715
-5.1214595 -1.2029996 -4.793585 ], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.319163 -2.4501457 -4.120945 -5.0328226 -1.6761187 -8.229664
-5.1214366 -1.202993 -4.7935667], shape=(9,), dtype=float32)

So, what can cause such behavior? How to solve this problem?

lewtun · April 13, 2021, 5:00pm

hey @Constantin, i think you might be missing a few preprocessing steps for token classification (i’m assuming that you’re doing something like named entity recognition).

if your input examples have already been split into words then add the is_split_into_words=True argument to the tokenizer
align the labels and tokens - see the tokenize_and_align_labels function in this tutorial: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=jmNN-iX1KUHd
i’m not familiar with the tensorflow api, but you might also need to specify the DataCollatorForTokenClassification collator for this task

Topic		Replies	Views
Fine-tuning XLM-RoBERTa for binary sentiment classification Beginners	1	1426	November 4, 2021
Challenges Achieving Satisfactory Accuracy in Fine-Tuning RoBERTa on a Custom Masked Token Prediction Dataset 🤗Transformers	2	294	March 4, 2024
Using roberta for token-classification, strange characters Models	0	267	July 10, 2023
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2953	December 15, 2020
RoBERTa fine-tuning on a dataset of short sentences and low cardinality 🤗Transformers	0	718	December 4, 2023

Fine-tune TF-XML-ROBERTa for token classification

Related topics