BlenderBot forward method crashing

I am trying to use BlenderbotForConditionalGeneration, and I’m getting the following error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-51-b8db07c0c647> in <module>()
      2     input_ids = encoding['input_ids'],
      3     attention_mask = encoding['attention_mask'],
----> 4     labels=labels)

9 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2181         # remove once script supports set_grad_enabled
   2182         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2183     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2184 
   2185 

IndexError: index out of range in self

Actual Code


MODEL_NAME = "facebook/blenderbot-400M-distill"
tokenizer = BlenderbotTokenizer.from_pretrained(MODEL_NAME)

encoding = tokenizer(
    sample_question['question'],
    sample_question['context'],
    max_length=1024,
    padding='max_length',
    truncation="only_second",
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors="pt"
)

answer_encoding = tokenizer(
    sample_question['answer_text'],
    max_length=1024,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors="pt"
)
labels = answer_encoding["input_ids"]


model = BlenderbotForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict = True)

output = model(
    input_ids = encoding['input_ids'],
    attention_mask = encoding['attention_mask'],
    labels = labels
)

As it crashes during the embedding lookup, did you check that the vocabulary file is the correct one?

Hi @thies!,

Thanks for checking. I got it resolved over the discord, The issue was in the tokenizer max_length was wrong. The best way to find this would be checking model.config.max_position_embeddings.

tokenizer(
      data_row['question'],
      data_row['context'],
      max_length=128
      )
1 Like

If anyone is interested in the notebook link: here.
I hardcoded max_len to 128, by looking at the model.config.