Pipeline Llama3 Text Generation Saving a Memory/Cache

swtb · April 30, 2024, 1:19pm

Hello,

I have been using LLama3 for running a set of prompt series in a few shot fashion. I have noticed that when changing from prompt series 1 to prompt series 2 that the model generates information that it could only have picked up from prompt series 1.

This leads me to believe that the pipeline or model itself are caching information. The only solution I can find is to just reload everything. Is there a more efficient way to ensure that the model is generating with a clean context for every new series of prompts?

Cheers

RaushanTurganbay · April 30, 2024, 1:45pm

Can you share a minimum reproducible code?

If you are using transformers model.generate(), the cache should be be saved between calls, so that each call starts from a new cache unless specifically passed-in with past_key_values.

swtb · April 30, 2024, 1:59pm


model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline_instance = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={        
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True
        },
    device_map="cuda",
    token='redacted'
)



terminators = [
    pipeline_instance.tokenizer.eos_token_id,
    pipeline_instance.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

    messages.append({"role": "user", "content" : f"{query}"})
    prompt = pipeline_instance.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    for message in messages:
        print(message)
    outputs = pipeline_instance(
        prompt,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )

The data are sensitive so I cant share the content of “messages”. Though it is just a system prompt follows by a few user, assitant turns (fewshot style)

I noticed this problem when I changed the system prompt. As the generated text still contained key words that I had only mentioned in the previous system prompt. Model is Llama3

RaushanTurganbay · April 30, 2024, 4:04pm

Hmm I cannot reproduce it with the following code. My wild guess is that the messages from old turns were left in your history, which led to the final “prompt” after apply_chat_template to contain older system prompt

messages = [
    [{"role": "user", "content" : f"You are a counch that always gives unhelpful advice. What should I eat for dinner?"}],
    [{"role": "user", "content" : f"What do I say to my boss if I am late for work?"}],
    ]

for message in messages:
    prompt = pipeline_instance.tokenizer.apply_chat_template(
        message, 
        tokenize=False, 
        add_generation_prompt=True
    )

    outputs = pipeline_instance(
        prompt,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    print(outputs)

swtb · May 1, 2024, 8:16am

Trying again today, weirdly I cannot replicate the issue anymore. All I have done is turn off the AWS instance and jupyter server I was running it on.

Wait, nevermind. It does still remember the system prompts when using {“role”: “system”, “content”: “text”} between sessions.

fasterinnerlooper · May 7, 2024, 12:42pm

It sounds like there is a variable that you’re using that’s not being reset between your runs. I would replicate your code for series 1 and series 2 and make sure none of the variables are shared except for explicitly setting the model and probably the tokenizer.

If restarting the Notebook fixes the issue then it points me to conclude that a variable is not being reset between the runs.

swtb · May 7, 2024, 1:39pm

This was my thoughts as well. Though every variable is reinitialised between runs.

I thought maybe its possible the model or pipeline class instance is keeping a cache of the system prompt. For now its not an urgent problem, I am happy to just reset the instance for each new experiment run.

Owos · January 5, 2025, 2:50am

@RaushanTurganbay, @swtb I just expereinced the same thing with the Llama3.1 8B model. Do you this due to the KV caching used in the Llama models or something else?

RaushanTurganbay · January 5, 2025, 6:11pm

@Owos the models usually do not save past kv cache unless explicitly asked to continue generation (for ex by providing the whole past history or prev cache). Can you try to reproduce the behavior on Colab and share?

If you are experiencing the same issue as before and restarting the kernel in the notebook solves the issue, then it is most probably the way code is written which uses some global variables

Owos · January 5, 2025, 8:19pm

I don’t think it is a model thing. I think it is an issue with the pipeline too. So say I have 100 examples to run in a loop, the first few examples usually take longer timer to run and after that, the iteration over the examples become super fast.

Topic		Replies	Views
Prompt caching in pipelines 🤗Transformers	1	72	May 27, 2025
How to cache common instruction prompt 🤗Transformers	16	2508	October 31, 2024
Pass CausalLM KV cache into the next inference batch 🤗Transformers	0	563	October 14, 2023
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	229	January 22, 2025
Pipelines for Chat Generation with Memory Beginners	3	3690	March 15, 2024

Pipeline Llama3 Text Generation Saving a Memory/Cache

Related topics