Pipeline Llama3 Text Generation Saving a Memory/Cache

Hello,

I have been using LLama3 for running a set of prompt series in a few shot fashion. I have noticed that when changing from prompt series 1 to prompt series 2 that the model generates information that it could only have picked up from prompt series 1.

This leads me to believe that the pipeline or model itself are caching information. The only solution I can find is to just reload everything. Is there a more efficient way to ensure that the model is generating with a clean context for every new series of prompts?

Cheers

Can you share a minimum reproducible code?

If you are using transformers model.generate(), the cache should be be saved between calls, so that each call starts from a new cache unless specifically passed-in with past_key_values.


model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline_instance = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={        
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True
        },
    device_map="cuda",
    token='redacted'
)



terminators = [
    pipeline_instance.tokenizer.eos_token_id,
    pipeline_instance.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

    messages.append({"role": "user", "content" : f"{query}"})
    prompt = pipeline_instance.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    for message in messages:
        print(message)
    outputs = pipeline_instance(
        prompt,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )

The data are sensitive so I cant share the content of “messages”. Though it is just a system prompt follows by a few user, assitant turns (fewshot style)

I noticed this problem when I changed the system prompt. As the generated text still contained key words that I had only mentioned in the previous system prompt. Model is Llama3

Hmm I cannot reproduce it with the following code. My wild guess is that the messages from old turns were left in your history, which led to the final “prompt” after apply_chat_template to contain older system prompt

messages = [
    [{"role": "user", "content" : f"You are a counch that always gives unhelpful advice. What should I eat for dinner?"}],
    [{"role": "user", "content" : f"What do I say to my boss if I am late for work?"}],
    ]

for message in messages:
    prompt = pipeline_instance.tokenizer.apply_chat_template(
        message, 
        tokenize=False, 
        add_generation_prompt=True
    )

    outputs = pipeline_instance(
        prompt,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    print(outputs)

Trying again today, weirdly I cannot replicate the issue anymore. All I have done is turn off the AWS instance and jupyter server I was running it on.

Wait, nevermind. It does still remember the system prompts when using {“role”: “system”, “content”: “text”} between sessions.

It sounds like there is a variable that you’re using that’s not being reset between your runs. I would replicate your code for series 1 and series 2 and make sure none of the variables are shared except for explicitly setting the model and probably the tokenizer.

If restarting the Notebook fixes the issue then it points me to conclude that a variable is not being reset between the runs.

This was my thoughts as well. Though every variable is reinitialised between runs.

I thought maybe its possible the model or pipeline class instance is keeping a cache of the system prompt. For now its not an urgent problem, I am happy to just reset the instance for each new experiment run.