ValueError: too many values to unpack (expected 2) when using BertTokenizer

Hi everyone,
I get an error when using BertTokenizer .
I do
encoding = tokenizer([[prompt, prompt, prompt], [choice0, choice1, choice2]], return_tensors='tf', padding=True))
and get
ValueError: too many values to unpack (expected 2) .
When I do
encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='tf', padding=True)
it works. Any idea why? I want to fine-tune TFBertForMultipleChoice such that each question ( prompt ) has three choices and not two as in the documentation BERT — transformers 4.7.0 documentation

Below is the complete code

import os
import numpy as np
import pandas as pd
import tensorflow as tf

from transformers import BertTokenizer, TFBertForMultipleChoice


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForMultipleChoice.from_pretrained('bert-base-uncased')

prompt = "Accept and check containers of mail from large volume mailers, couriers, and contractors."
choice0 = "Time Management"
choice1 = "Writing"
choice2 = "Reading Comprehension"

encoding = tokenizer([[prompt, prompt, prompt], [choice0, choice1, choice2]], return_tensors='tf', padding=True)
inputs = {k: tf.expand_dims(v, 0) for k, v in encoding.items()}
outputs = model(inputs)  # batch size is 1

logits = outputs.logits

Thanks!
Ayala

Oh the example has one pairs of [] that is not necessary, will fix. It should be:

encoding = tokenizer([prompt, prompt, prompt], [choice0, choice1, choice2], return_tensors='tf', padding=True)
1 Like

Thanks!
But when I do

encoding = tokenizer([prompt, prompt, prompt], [choice0, choice1, choice2], return_tensors='tf', padding=True)

The encoding looks like the following:

{'input_ids': <tf.Tensor: shape=(3, 23), dtype=int32, numpy=
array([[  101,  5138,  1998,  4638, 16143,  1997,  5653,  2013,  2312,
         3872,  5653,  2545,  1010, 18092,  2015,  1010,  1998, 16728,
         1012,   102,  2051,  2968,   102],
       [  101,  5138,  1998,  4638, 16143,  1997,  5653,  2013,  2312,
         3872,  5653,  2545,  1010, 18092,  2015,  1010,  1998, 16728,
         1012,   102,  3015,   102,     0],
       [  101,  5138,  1998,  4638, 16143,  1997,  5653,  2013,  2312,
         3872,  5653,  2545,  1010, 18092,  2015,  1010,  1998, 16728,
         1012,   102,  3752, 26683,   102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(3, 23), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(3, 23), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1]], dtype=int32)>}

which as far as I understand is encoded as 3 pairs of texts and not as one question with 3 choices. Namely, wouldn’t I want the encoding to look something like
[101, 5138, ..., 102, 2051, 2968..., 102, 3015, ..., 102, 3752..., 102]
In other words, if I want to fine-tune TFBertForMultipleChoice, don’t I need to encode the prompt and choices as prompt choice0 choice1 choice2?

Thanks,
Ayala

No, you need tensors of shape 1, num_choices, seq_length, which is what the code sample will give you with the second line that does the expand_dims.

1 Like

Thank you very much, I was about to write the same!
I have one more question, hope that is ok. How can I tokenize more than one prompt it’s choices? Namely, I have a batch of prompts and each one has it’s three choices. Cannot figure how to tokenize it. Should I use

encoding = tokenizer([prompt, prompt, prompt], [choice0, choice1, choice2], return_tensors='tf', padding=True)
inputs = {k: tf.expand_dims(v, 0) for k, v in encoding.items()}

for each prompt and it’s choices one at a time? Or there is a way to tokenize them in one batch?

Thanks again,
Ayala

You should look at the multiple choice exmaples (PyTorch only but the processing is the same) or the notebook that goes with it.

1 Like

Great! Exactly what I was looking for.

Ayala