Getting outputs of mode.predict() per sentence input

vinurad13 · June 21, 2021, 1:19pm

Hi,
I am using a TF XLM-R base classifier model-checkpoint (“jplu/tf-xlm-roberta-base”) and the tf keras native’train()’ method. On prediction (mode.predict()) I get an output logits array having 166632 length. I am providing an input of 786 data points (sentence) only. I think the 166632 is the product of the no. of input ids (212) from tokenization (from Autotokenizer) and input dataset length, but I’m not sure how that can be explained. Can someone explain how to derive prediction result per each sentence from this mode.predict output?

test_encodings = tokenizer(X_test, truncation=True, padding=True)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))
test_dataset

<TensorSliceDataset shapes: ({input_ids: (212,), attention_mask: (212,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.int64)>

I form train and validation datasets similarly, for fine -tuning. when predicting,

out=model.predict(test_dataset)
len(out.logits)
out
166632
TFSequenceClassifierOutput(loss=None, logits=array([[-0.27663636,  0.68009704,  1.0416636 , -0.9192458 ],
       [-0.27665925,  0.68014   ,  1.0416217 , -0.91923165],
       [-0.27644584,  0.6797307 ,  1.0419688 , -0.91936153],
       ...,
       [-0.25672776,  0.64896476,  1.0766468 , -0.92797905],
       [-0.2567277 ,  0.64896476,  1.0766468 , -0.9279789 ],
       [-0.2567277 ,  0.64896476,  1.0766468 , -0.9279789 ]],
      dtype=float32), hidden_states=None, attentions=None)

Thanks

lewtun · June 21, 2021, 2:48pm

hey @vinurad13 what shape does test_encodings["input_ids"] have?

without seeing the details behind X_test my guess is that you need to reshape your inputs so that input_ids has shape (batch_size, max_seq_length)

vinurad13 · June 21, 2021, 4:34pm

Hi, thanks for replying. This is the shape, (786 is the test dataset size)

np.array(test.encodings['input_ids']).shape
(786, 212)

I am obtaining X_test, y_test from a csv read and train-test split and get their tokens as above. How can I reshape the input_ids in the tokenization step? it also warns,

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

Furthermore, I used the trainer() API before trying the native Keras methods with the same inputs and input shapes. trainer.evaluate() and trainer.predict() gave proper outputs which are similar to input test dataset lengths.

vinurad13 · June 21, 2021, 5:44pm

Hi,

Found that it gives the correct length of output if I feed the input like below;

out=model.predict(test_dataset.shuffle(500).batch(16))

where buffer_size and batch size could be changed. it seems that the dataset object should be always fed like this (not directly)

Topic		Replies	Views
Labels shape when using model.fit and TFGPT2LMHeadModel 🤗Transformers	0	755	February 1, 2021
Feature extraction output Beginners	0	414	March 12, 2022
Am I doing this right? Beginners	1	511	July 12, 2020
Questions about the shape of T5 logits Beginners	4	2627	September 23, 2021
Confusion about trainer.predict(dataset['test']) output 🤗Transformers	0	534	November 3, 2022

Getting outputs of mode.predict() per sentence input

Related topics