Encoding sentence pair with BERT cause ValueError: not enough values to unpack (expected 2, got 1)

Hi,
It seems that I somehow feed input in the wrong shape. But I don’t understand how to fix this.
I follow this code example: nlp-notebooks/Fine_tune_ALBERT_sentence_pair_classification.ipynb at master · NadirEM/nlp-notebooks · GitHub

here is my code:

tokenizer = AutoTokenizer.from_pretrained('albert-base-v2')
bert = AutoModel.from_pretrained('albert-base-v2')

sent1= 'I love chocolate cakes'
sent2 = 'I love sweets'

encoded_pair = tokenizer(sent1, sent2, 
                              padding='max_length',  # Pad to max_length
                              truncation=True,  # Truncate to max_length
                              max_length=512,  
                              return_tensors='pt') 

token_ids = encoded_pair['input_ids'].squeeze(0) 
attn_masks = encoded_pair['attention_mask'].squeeze(0) 
token_type_ids = encoded_pair['token_type_ids'].squeeze(0)

cont_reps, pooler_output = bert(input_ids, attn_masks, token_type_ids)

It doesn’t work even w\o the ‘squeeze()’ function.
Please if someone can help me, I am stuck on this for a long time now. :frowning:

here is the trackback:

ValueError                                Traceback (most recent call last)
Input In [261], in <cell line: 1>()
----> 1 cont_reps, pooler_output = alephbert(input_ids, attn_masks, token_type_ids)

File C:\Users\BUDBUDIO\Anaconda3\lib\site-packages\torch\nn\modules\module.py:889, in Module._call_impl(self, *input, **kwargs)
    887     result = self._slow_forward(*input, **kwargs)
    888 else:
--> 889     result = self.forward(*input, **kwargs)
    890 for hook in itertools.chain(
    891         _global_forward_hooks.values(),
    892         self._forward_hooks.values()):
    893     hook_result = hook(self, input, result)

File C:\Users\BUDBUDIO\Anaconda3\lib\site-packages\transformers\models\albert\modeling_albert.py:716, in AlbertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict)
    713 else:
    714     raise ValueError("You have to specify either input_ids or inputs_embeds")
--> 716 batch_size, seq_length = input_shape
    717 device = input_ids.device if input_ids is not None else inputs_embeds.device
    719 if attention_mask is None:

ValueError: not enough values to unpack (expected 2, got 1)

Solved!

Apparently the ‘input_ids’, ‘attention_mask’, ‘token_type_ids’ all needs to be of size
(batch_size, sequence_length) , so when I used

.unsqueeze(0)

instead of

.squeeze(0)

it worked.
In addition, the tokenizer should be added with the parameter: is_split_into_words=True ,
to avoid ambiguity with a batch of sequences.

Hope it helps to others stuck on the same thing.

2 Likes