In the process of inference of the vicuna-13b-16k model I get the following error when the context length is greater than 4096 tokens:
RuntimeError: The size of tensor a (4096) must match the size of tensor b (4097) at non-singleton dimension 3
As I understand it, there are two parameters in the config that allow you to achieve the length of the context (prompt and model response) of 16384 tokens:
"max_position_embeddings": 4096,
"rope_scaling": {
"factor": 4.0,
The error occurs in the self._update_causal_mask
function in the following line:
padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
causal_mask
has dimensions [1, 1, 4096, 4096]
.
The cindition if seq_length > self.causal_mask.shape[-1]:
is not met because the generation uses the kv-cache and the dimension inputs_embeds
(input_tensor
) is [1, 1, 5120]
.
Are there any ideas on how to fix this error?