I am using the pre-trained google/bigbird-pegasus-large-arxiv
model.
But I receive the following update during the forward pass.
Attention type 'block_sparse' is not possible if sequence_length: 458 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...
I understand the update and I am aware of benefit of time and memory it saves while using block_sparse
than original_full
.
So, how should I go about selecting the suitable block_size
and num_random_blocks
when I know that there is a lot of variation in the sequence length of my inputs?