Hello
I would like to finetune TF-XML-RoBERTa for token classification with custom dataset. Quite simple task, I know. But I have some problems with it…
I start with defining Tokenizer
tokenizer = AutoTokenizer.from_pretrained('jplu/tf-xlm-roberta-large')
Firstly, I convert conllu2003 dataset into tf.Dataset. After it, I have the dataset in the following format:
({'input_ids': array([[ 0, 3747, 456, 75161, 7, 30839, 11782, 47, 25299,
47924, 18, 56101, 21, 6492, 6, 5, 2, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0]], dtype=int32),
'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]], dtype=int32)},
array([[0, 3, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]], dtype=int32))
Secondly, I build model based on pretrained ‘jplu/tf-xlm-roberta-large’. I have done it with two a little bit different ways. The first one is to apply default TFXLMRobertaForTokenClassification:
model = TFXLMRobertaForTokenClassification.from_pretrained( 'jplu/tf-xlm-roberta-large',num_labels=len(labels))
And train it.
model.fit(tfdataset_train)
14041/14041 [==============================] - 2700s 191ms/step - loss: 0.4062 - accuracy: 0.9713
Finally evaluate it
benchmarks = model.evaluate(tfdataset_test, return_dict=True, batch_size=2)
3453/3453 [==============================] - 127s 36ms/step - loss: 0.4149 - accuracy: 0.9743
Look like great accuracy, but in fact when I run model on some example from train data it just returns almost same logits for each token! It predicts non named entity class for each token!
res = model(next(tfdataset_train.batch(1).as_numpy_iterator())[0]).logits
for i in range(15):
print(res[0][i])
tf.Tensor(
[ 2.3191597 -2.450158 -4.1209745 -5.032844 -1.676107 -8.229665
-5.121443 -1.2029874 -4.7935658], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.3191764 -2.4501915 -4.120973 -5.032861 -1.6761211 -8.229721
-5.121465 -1.2030123 -4.793587 ], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.3191643 -2.450162 -4.120951 -5.032834 -1.6761119 -8.229675
-5.121448 -1.2029958 -4.7935658], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.3191674 -2.4501798 -4.1209555 -5.032833 -1.67613 -8.229715
-5.1214595 -1.2029996 -4.793585 ], shape=(9,), dtype=float32)
tf.Tensor(
[ 2.319163 -2.4501457 -4.120945 -5.0328226 -1.6761187 -8.229664
-5.1214366 -1.202993 -4.7935667], shape=(9,), dtype=float32)
So, what can cause such behavior? How to solve this problem?