Below is my code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(âfalcon-mamba-7bâ)
model = AutoModelForCausalLM.from_pretrained(â/falcon-mamba-7bâ, device_map=âautoâ, torch_dtype=torch.bfloat16)
input_text = [", ".join([âIron Manâ]*7)]
input_ids = tokenizer(input_text, return_tensors=âptâ).input_ids.to(âcudaâ)
a = model(input_ids, output_hidden_states=True).hidden_states
cache = model(input_ids[:, :6]).cache_params
b = model(input_ids[:, 6:], cache_params=cache, cache_position=torch.tensor([0, 1, 2, 3]), output_hidden_states=True).hidden_states
print((a[-1][0][-1]-b[-1][0][-1]).abs().max())
For sequential prefilling like this, the generated hidden_states by the two ways should be the same for Mamba. However, my code doesnât work well. Iâm wondering how to set âcache_postitionâ properly, and it seems that it only accepts tensor of shape(4), where 4 is the default conv_kernel size.
There is no example code for features like this, can anyone help me?