Fine-tune TF-XML-ROBERTa for token classification

Hello

I would like to finetune TF-XML-RoBERTa for token classification with custom dataset. Quite simple task, I know. But I have some problems with it…

I start with defining Tokenizer

tokenizer = AutoTokenizer.from_pretrained('jplu/tf-xlm-roberta-large')

Firstly, I convert conllu2003 dataset into tf.Dataset. After it, I have the dataset in the following format:

({'input_ids': array([[    0,  3747,   456, 75161,     7, 30839, 11782,    47, 25299,
      47924,    18, 56101,    21,  6492,     6,     5,     2,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0,     0,     0,     0,     0,     0,     0,     0,
          0,     0]], dtype=int32),
  'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0]], dtype=int32)},
 array([[0, 3, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]], dtype=int32))

Secondly, I build model based on pretrained ‘jplu/tf-xlm-roberta-large’. I have done it with two a little bit different ways. The first one is to apply default TFXLMRobertaForTokenClassification:

model = TFXLMRobertaForTokenClassification.from_pretrained( 'jplu/tf-xlm-roberta-large',num_labels=len(labels))

And train it.

model.fit(tfdataset_train)
14041/14041 [==============================] - 2700s 191ms/step - loss: 0.4062 - accuracy: 0.9713

Finally evaluate it

benchmarks = model.evaluate(tfdataset_test, return_dict=True, batch_size=2)
3453/3453 [==============================] - 127s 36ms/step - loss: 0.4149 - accuracy: 0.9743

Look like great accuracy, but in fact when I run model on some example from train data it just returns almost same logits for each token! It predicts non named entity class for each token!

 res = model(next(tfdataset_train.batch(1).as_numpy_iterator())[0]).logits
for i in range(15):
     print(res[0][i])

tf.Tensor(
[ 2.3191597 -2.450158 -4.1209745 -5.032844 -1.676107 -8.229665
-5.121443 -1.2029874 -4.7935658], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.3191764 -2.4501915 -4.120973 -5.032861 -1.6761211 -8.229721
-5.121465 -1.2030123 -4.793587 ], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.3191643 -2.450162 -4.120951 -5.032834 -1.6761119 -8.229675
-5.121448 -1.2029958 -4.7935658], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.3191674 -2.4501798 -4.1209555 -5.032833 -1.67613 -8.229715
-5.1214595 -1.2029996 -4.793585 ], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.319163 -2.4501457 -4.120945 -5.0328226 -1.6761187 -8.229664
-5.1214366 -1.202993 -4.7935667], shape=(9,), dtype=float32)

So, what can cause such behavior? How to solve this problem?

hey @Constantin, i think you might be missing a few preprocessing steps for token classification (i’m assuming that you’re doing something like named entity recognition).

  1. if your input examples have already been split into words then add the is_split_into_words=True argument to the tokenizer
  2. align the labels and tokens - see the tokenize_and_align_labels function in this tutorial: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=jmNN-iX1KUHd
  3. i’m not familiar with the tensorflow api, but you might also need to specify the DataCollatorForTokenClassification collator for this task