I have been using LLama3 for running a set of prompt series in a few shot fashion. I have noticed that when changing from prompt series 1 to prompt series 2 that the model generates information that it could only have picked up from prompt series 1.
This leads me to believe that the pipeline or model itself are caching information. The only solution I can find is to just reload everything. Is there a more efficient way to ensure that the model is generating with a clean context for every new series of prompts?
If you are using transformers model.generate(), the cache should be be saved between calls, so that each call starts from a new cache unless specifically passed-in with past_key_values.
The data are sensitive so I cant share the content of “messages”. Though it is just a system prompt follows by a few user, assitant turns (fewshot style)
I noticed this problem when I changed the system prompt. As the generated text still contained key words that I had only mentioned in the previous system prompt. Model is Llama3
Hmm I cannot reproduce it with the following code. My wild guess is that the messages from old turns were left in your history, which led to the final “prompt” after apply_chat_template to contain older system prompt
messages = [
[{"role": "user", "content" : f"You are a counch that always gives unhelpful advice. What should I eat for dinner?"}],
[{"role": "user", "content" : f"What do I say to my boss if I am late for work?"}],
]
for message in messages:
prompt = pipeline_instance.tokenizer.apply_chat_template(
message,
tokenize=False,
add_generation_prompt=True
)
outputs = pipeline_instance(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(outputs)
It sounds like there is a variable that you’re using that’s not being reset between your runs. I would replicate your code for series 1 and series 2 and make sure none of the variables are shared except for explicitly setting the model and probably the tokenizer.
If restarting the Notebook fixes the issue then it points me to conclude that a variable is not being reset between the runs.
This was my thoughts as well. Though every variable is reinitialised between runs.
I thought maybe its possible the model or pipeline class instance is keeping a cache of the system prompt. For now its not an urgent problem, I am happy to just reset the instance for each new experiment run.
@RaushanTurganbay, @swtb I just expereinced the same thing with the Llama3.1 8B model. Do you this due to the KV caching used in the Llama models or something else?
@Owos the models usually do not save past kv cache unless explicitly asked to continue generation (for ex by providing the whole past history or prev cache). Can you try to reproduce the behavior on Colab and share?
If you are experiencing the same issue as before and restarting the kernel in the notebook solves the issue, then it is most probably the way code is written which uses some global variables
I don’t think it is a model thing. I think it is an issue with the pipeline too. So say I have 100 examples to run in a loop, the first few examples usually take longer timer to run and after that, the iteration over the examples become super fast.