Sequential Prefilling w/ Mamba

CyberDancer · October 21, 2024, 3:37am

Below is my code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(“falcon-mamba-7b”)
model = AutoModelForCausalLM.from_pretrained(“/falcon-mamba-7b”, device_map=“auto”, torch_dtype=torch.bfloat16)

input_text = [", ".join([“Iron Man”]*7)]
input_ids = tokenizer(input_text, return_tensors=“pt”).input_ids.to(“cuda”)

a = model(input_ids, output_hidden_states=True).hidden_states

cache = model(input_ids[:, :6]).cache_params

b = model(input_ids[:, 6:], cache_params=cache, cache_position=torch.tensor([0, 1, 2, 3]), output_hidden_states=True).hidden_states

print((a[-1][0][-1]-b[-1][0][-1]).abs().max())

For sequential prefilling like this, the generated hidden_states by the two ways should be the same for Mamba. However, my code doesn’t work well. I’m wondering how to set ‘cache_postition’ properly, and it seems that it only accepts tensor of shape(4), where 4 is the default conv_kernel size.

There is no example code for features like this, can anyone help me?

Topic		Replies	Views
Mamba2 Cache Position 🤗Transformers	4	132	May 12, 2025
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	189	January 22, 2025
Fine-Tuning a Mamba Model with using Hugging Face Transformers 🤗Transformers	1	171	March 18, 2025
Provide examples to model before inferencing and how to cache the examples Beginners	0	20	March 5, 2025
Need help performance issues transformers.AutoModelForCausalLM.from_pretrained( 'mosaicml/mpt-7b-instruct' Beginners	0	930	June 12, 2023

Sequential Prefilling w/ Mamba

Related topics