Bigbirdmodel: Problem with running code provided in documentation

Hey folks, QQ: Has anyone tried running the provided code in Bigbird documentation and run into problems? I’m simply trying to embed some input using the pre-trained model for initial exploration, and I’m running into an error: IndexError: index out of range in self

Has anyone come across this error before or seen a fix for it? Thanks.
Full stack trace below:

IndexError Traceback (most recent call last)
6 inputs = tokenizer(“Hello, my dog is cute”, return_tensors=“pt”)
----> 7 outputs = model(**inputs)
8 outputs

~/SageMaker/persisted_conda_envs/intercom_kevin/lib/python3.6/site-packages/torch/nn/modules/ in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
→ 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/SageMaker/persisted_conda_envs/intercom_kevin/lib/python3.6/site-packages/transformers/models/big_bird/ in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
2076 token_type_ids=token_type_ids,
2077 inputs_embeds=inputs_embeds,
→ 2078 past_key_values_length=past_key_values_length,
2079 )

~/SageMaker/persisted_conda_envs/intercom_kevin/lib/python3.6/site-packages/torch/nn/modules/ in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
→ 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/SageMaker/persisted_conda_envs/intercom_kevin/lib/python3.6/site-packages/transformers/models/big_bird/ in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
284 if inputs_embeds is None:
→ 285 inputs_embeds = self.word_embeddings(input_ids)
287 if self.rescale_embeddings:

~/SageMaker/persisted_conda_envs/intercom_kevin/lib/python3.6/site-packages/torch/nn/modules/ in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
→ 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/SageMaker/persisted_conda_envs/intercom_kevin/lib/python3.6/site-packages/torch/nn/modules/ in forward(self, input)
124 return F.embedding(
125 input, self.weight, self.padding_idx, self.max_norm,
→ 126 self.norm_type, self.scale_grad_by_freq, self.sparse)
128 def extra_repr(self) → str:

~/SageMaker/persisted_conda_envs/intercom_kevin/lib/python3.6/site-packages/torch/nn/ in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1812 # remove once script supports set_grad_enabled
1813 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
→ 1814 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

cc @vasudevgupta

hi @khmcnally,

I was running this:

m = BigBirdModel.from_pretrained("google/bigbird-roberta-base")
sample = tokenizer("Hello, my dog is cute", return_tensors="pt")

and it’s working for me. Are you using model in some other configuration? If so, please share that, I will try to run.

1 Like

Thanks so much for the quick reply. I’m simply running this:

from transformers import BigBirdTokenizer, BigBirdModel

tokenizer = BigBirdTokenizer.from_pretrained('google/bigbird-roberta-base')
model = BigBirdModel.from_pretrained('google/bigbird-roberta-base')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

Perhaps it’s my dev configuration that’s the issue? I’m running my own conda environment with transformers v4.5.1, if that helps?

Edit: There’s also some logging provided that may be useful context:

Attention type 'block_sparse' is not possible if sequence_length: 8 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...

So, this issue will be fixed if you work with either latest pip version or master branch.

It’s fine to have warning which you mentioned. Since your sequence length is very small, :hugs: will shift your model’s attention_type to "original_full". For using block sparse attention, you will have to input long sequence following rule as per warning.

Great Vasudev, thanks so much for your help here! Pardon my ignorance, but do I update to latest version of transformers package or is it some other package I should update? Which version should I be using exactly? Is there a later version than 4.5.1?

# To install latest version. run this 
pip3 install transformers==4.5.1

## or

# if you want to work with master branch
pip3 install git+

Thanks so much Vasudev, that worked perfectly!

1 Like

Hi @vasudevgupta (appreciate I’m cheekily @-ing you!)

The above runs fine, but the number of dimensions output are different depending on the length of the text.

For example if i use the string 'hello' and call .shape on the tensor returned, the dims are ([1,3,768])
But if I use 'hello my dog is cute', the dims are [(1,7,768)].

Perhaps this is expected behaviour? I have experience using various language models in tensorflow and they tend to generate vectors with the same numbers of dimensions. Perhaps there is something really fundamental or obvious I’m missing here?

For context, I’m exploring matching conversational type questions to knowledge base articles, and I wanted to test this language model for suitability. So I’m trying to calculate cosine similarity between embeddings of questions and those of articles.

Any help would be much appreciated! My guess is I have to use one of the hidden layers as embeddings but I’m not sure.

hi, this is happening because :hugs: Tokenizer is adding [CLS] & [SEP] token. Hence input sequence length is 3 not 1.

Ah okay, so I just use the CLS token, grabbing it with something like this?

Piling onto this thread; I’m fine with the automatic change to full attention, but is there a way to suppress the warning? Also, I’m kind of assuming that “original_full” doesn’t mean that it’s fundamentally the same as a standard bidirectional model like BERT, but is that correct? (I guess I’m not totally clear on the difference between sparse attention and “block_sparse” attention.)