Idefics2 multi turn inference

Hi, I’m running inference with the idefics2 model. I’m following the instructions from the post below, but there’s a part I don’t understand. HuggingFaceM4/idefics2-8b · Hugging Face

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
).to(DEVICE)

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },       
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']

The code above includes the assistant’s response in the message. Isn’t this pre-determining what the assistant will say? The assistant’s response should come after the model has analyzed the image. Shouldn’t the assistant generate the response automatically instead of me providing it?
I have five images taken at the same location, and I want to input all these images in the first turn. For example, I want to ask:

  1. What type of place do you think this is?

Based on the assistant’s response, I want to ask follow-up questions in a multi-turn conversation. My example might not be the best fit for multi-turn, but ultimately, I want to know how to run multi-turn inference.
How should I format the messages for this? Can anyone help?

Also, I have another question. Is there a prompt to prevent the model from using specific words? I’ve tried using phrases like ‘do not use these terms: aaa, bbb, ccc’, but the model keeps outputting those words. It seems like the model isn’t recognizing the negative instruction well, and I’m not sure how to make it adhere to this rule.

Additionally, how should I provide a system prompt?

To summarize,

  1. How to perform multi-turn inference and format the messages (chat template).
  2. How to prevent the model from using specific unwanted words in its responses.
  3. How to change or add a system prompt