Thanks!
But when I do
encoding = tokenizer([prompt, prompt, prompt], [choice0, choice1, choice2], return_tensors='tf', padding=True)
The encoding
looks like the following:
{'input_ids': <tf.Tensor: shape=(3, 23), dtype=int32, numpy=
array([[ 101, 5138, 1998, 4638, 16143, 1997, 5653, 2013, 2312,
3872, 5653, 2545, 1010, 18092, 2015, 1010, 1998, 16728,
1012, 102, 2051, 2968, 102],
[ 101, 5138, 1998, 4638, 16143, 1997, 5653, 2013, 2312,
3872, 5653, 2545, 1010, 18092, 2015, 1010, 1998, 16728,
1012, 102, 3015, 102, 0],
[ 101, 5138, 1998, 4638, 16143, 1997, 5653, 2013, 2312,
3872, 5653, 2545, 1010, 18092, 2015, 1010, 1998, 16728,
1012, 102, 3752, 26683, 102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(3, 23), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(3, 23), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1]], dtype=int32)>}
which as far as I understand is encoded as 3 pairs of texts and not as one question with 3 choices. Namely, wouldn’t I want the encoding to look something like
[101, 5138, ..., 102, 2051, 2968..., 102, 3015, ..., 102, 3752..., 102
]
In other words, if I want to fine-tune TFBertForMultipleChoice
, don’t I need to encode the prompt and choices as prompt choice0 choice1 choice2
?
Thanks,
Ayala