Getting outputs of mode.predict() per sentence input

Hi,
I am using a TF XLM-R base classifier model-checkpoint (“jplu/tf-xlm-roberta-base”) and the tf keras native’train()’ method. On prediction (mode.predict()) I get an output logits array having 166632 length. I am providing an input of 786 data points (sentence) only. I think the 166632 is the product of the no. of input ids (212) from tokenization (from Autotokenizer) and input dataset length, but I’m not sure how that can be explained. Can someone explain how to derive prediction result per each sentence from this mode.predict output?

test_encodings = tokenizer(X_test, truncation=True, padding=True)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))
test_dataset

<TensorSliceDataset shapes: ({input_ids: (212,), attention_mask: (212,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.int64)>

I form train and validation datasets similarly, for fine -tuning. when predicting,

out=model.predict(test_dataset)
len(out.logits)
out
166632
TFSequenceClassifierOutput(loss=None, logits=array([[-0.27663636,  0.68009704,  1.0416636 , -0.9192458 ],
       [-0.27665925,  0.68014   ,  1.0416217 , -0.91923165],
       [-0.27644584,  0.6797307 ,  1.0419688 , -0.91936153],
       ...,
       [-0.25672776,  0.64896476,  1.0766468 , -0.92797905],
       [-0.2567277 ,  0.64896476,  1.0766468 , -0.9279789 ],
       [-0.2567277 ,  0.64896476,  1.0766468 , -0.9279789 ]],
      dtype=float32), hidden_states=None, attentions=None)

Thanks

hey @vinurad13 what shape does test_encodings["input_ids"] have?

without seeing the details behind X_test my guess is that you need to reshape your inputs so that input_ids has shape (batch_size, max_seq_length)

Hi, thanks for replying. This is the shape, (786 is the test dataset size)

np.array(test.encodings['input_ids']).shape
(786, 212)

I am obtaining X_test, y_test from a csv read and train-test split and get their tokens as above. How can I reshape the input_ids in the tokenization step? it also warns,

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

Furthermore, I used the trainer() API before trying the native Keras methods with the same inputs and input shapes. trainer.evaluate() and trainer.predict() gave proper outputs which are similar to input test dataset lengths.

Hi,

Found that it gives the correct length of output if I feed the input like below;

out=model.predict(test_dataset.shuffle(500).batch(16))

where buffer_size and batch size could be changed. it seems that the dataset object should be always fed like this (not directly)

2 Likes