Hi,
I am using a TF XLM-R base classifier model-checkpoint (“jplu/tf-xlm-roberta-base”) and the tf keras native’train()’ method. On prediction (mode.predict()) I get an output logits array having 166632 length. I am providing an input of 786 data points (sentence) only. I think the 166632 is the product of the no. of input ids (212) from tokenization (from Autotokenizer) and input dataset length, but I’m not sure how that can be explained. Can someone explain how to derive prediction result per each sentence from this mode.predict output?
test_encodings = tokenizer(X_test, truncation=True, padding=True)
test_dataset = tf.data.Dataset.from_tensor_slices((
dict(test_encodings),
y_test
))
test_dataset
<TensorSliceDataset shapes: ({input_ids: (212,), attention_mask: (212,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.int64)>
I form train and validation datasets similarly, for fine -tuning. when predicting,
out=model.predict(test_dataset)
len(out.logits)
out
166632
TFSequenceClassifierOutput(loss=None, logits=array([[-0.27663636, 0.68009704, 1.0416636 , -0.9192458 ],
[-0.27665925, 0.68014 , 1.0416217 , -0.91923165],
[-0.27644584, 0.6797307 , 1.0419688 , -0.91936153],
...,
[-0.25672776, 0.64896476, 1.0766468 , -0.92797905],
[-0.2567277 , 0.64896476, 1.0766468 , -0.9279789 ],
[-0.2567277 , 0.64896476, 1.0766468 , -0.9279789 ]],
dtype=float32), hidden_states=None, attentions=None)
Thanks