ValueError: too many values to unpack (expected 2) when using BertTokenizer

ayalaall · July 13, 2021, 8:32am

Hi everyone,
I get an error when using BertTokenizer .
I do
encoding = tokenizer([[prompt, prompt, prompt], [choice0, choice1, choice2]], return_tensors='tf', padding=True))
and get
ValueError: too many values to unpack (expected 2) .
When I do
encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='tf', padding=True)
it works. Any idea why? I want to fine-tune TFBertForMultipleChoice such that each question ( prompt ) has three choices and not two as in the documentation BERT — transformers 4.7.0 documentation

Below is the complete code

import os
import numpy as np
import pandas as pd
import tensorflow as tf

from transformers import BertTokenizer, TFBertForMultipleChoice


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForMultipleChoice.from_pretrained('bert-base-uncased')

prompt = "Accept and check containers of mail from large volume mailers, couriers, and contractors."
choice0 = "Time Management"
choice1 = "Writing"
choice2 = "Reading Comprehension"

encoding = tokenizer([[prompt, prompt, prompt], [choice0, choice1, choice2]], return_tensors='tf', padding=True)
inputs = {k: tf.expand_dims(v, 0) for k, v in encoding.items()}
outputs = model(inputs)  # batch size is 1

logits = outputs.logits

Thanks!
Ayala

sgugger · July 13, 2021, 11:46am

Oh the example has one pairs of [] that is not necessary, will fix. It should be:

encoding = tokenizer([prompt, prompt, prompt], [choice0, choice1, choice2], return_tensors='tf', padding=True)

ayalaall · July 13, 2021, 2:03pm

Thanks!
But when I do

encoding = tokenizer([prompt, prompt, prompt], [choice0, choice1, choice2], return_tensors='tf', padding=True)

The encoding looks like the following:

{'input_ids': <tf.Tensor: shape=(3, 23), dtype=int32, numpy=
array([[  101,  5138,  1998,  4638, 16143,  1997,  5653,  2013,  2312,
         3872,  5653,  2545,  1010, 18092,  2015,  1010,  1998, 16728,
         1012,   102,  2051,  2968,   102],
       [  101,  5138,  1998,  4638, 16143,  1997,  5653,  2013,  2312,
         3872,  5653,  2545,  1010, 18092,  2015,  1010,  1998, 16728,
         1012,   102,  3015,   102,     0],
       [  101,  5138,  1998,  4638, 16143,  1997,  5653,  2013,  2312,
         3872,  5653,  2545,  1010, 18092,  2015,  1010,  1998, 16728,
         1012,   102,  3752, 26683,   102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(3, 23), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(3, 23), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1]], dtype=int32)>}

which as far as I understand is encoded as 3 pairs of texts and not as one question with 3 choices. Namely, wouldn’t I want the encoding to look something like
[101, 5138, ..., 102, 2051, 2968..., 102, 3015, ..., 102, 3752..., 102]
In other words, if I want to fine-tune TFBertForMultipleChoice, don’t I need to encode the prompt and choices as prompt choice0 choice1 choice2?

Thanks,
Ayala

sgugger · July 13, 2021, 2:05pm

No, you need tensors of shape 1, num_choices, seq_length, which is what the code sample will give you with the second line that does the expand_dims.

ayalaall · July 13, 2021, 2:41pm

Thank you very much, I was about to write the same!
I have one more question, hope that is ok. How can I tokenize more than one prompt it’s choices? Namely, I have a batch of prompts and each one has it’s three choices. Cannot figure how to tokenize it. Should I use

encoding = tokenizer([prompt, prompt, prompt], [choice0, choice1, choice2], return_tensors='tf', padding=True)
inputs = {k: tf.expand_dims(v, 0) for k, v in encoding.items()}

for each prompt and it’s choices one at a time? Or there is a way to tokenize them in one batch?

Thanks again,
Ayala

sgugger · July 13, 2021, 2:47pm

You should look at the multiple choice exmaples (PyTorch only but the processing is the same) or the notebook that goes with it.

ayalaall · July 13, 2021, 3:04pm

Great! Exactly what I was looking for.

Ayala

Topic		Replies	Views
Evaluating multiple choices using BertForMultipleChoice Beginners	0	612	June 20, 2021
Encoding sentence pair with BERT cause ValueError: not enough values to unpack (expected 2, got 1) Beginners	1	6809	November 13, 2022
Unable to add additional choices to VisualBertForMultipleChoice, 🤗Transformers	1	174	March 28, 2024
Using `TFBertTokenizer` instead of `BertTokenizer` with `TFBertForQuestionAnswering` 🤗Tokenizers	1	1258	November 15, 2022
Bert Tokenizer Parameter Possible Values 🤗Transformers	0	250	October 8, 2021

ValueError: too many values to unpack (expected 2) when using BertTokenizer

Related topics