I am using the pre-trained
But I receive the following update during the forward pass.
Attention type 'block_sparse' is not possible if sequence_length: 458 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...
I understand the update and I am aware of benefit of time and memory it saves while using
So, how should I go about selecting the suitable
num_random_blocks when I know that there is a lot of variation in the sequence length of my inputs?