Idefics2 multi turn inference

beenblue · August 5, 2024, 9:11am

Hi, I’m running inference with the idefics2 model. I’m following the instructions from the post below, but there’s a part I don’t understand. HuggingFaceM4/idefics2-8b · Hugging Face

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
).to(DEVICE)

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },       
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']

The code above includes the assistant’s response in the message. Isn’t this pre-determining what the assistant will say? The assistant’s response should come after the model has analyzed the image. Shouldn’t the assistant generate the response automatically instead of me providing it?
I have five images taken at the same location, and I want to input all these images in the first turn. For example, I want to ask:

What type of place do you think this is?

Based on the assistant’s response, I want to ask follow-up questions in a multi-turn conversation. My example might not be the best fit for multi-turn, but ultimately, I want to know how to run multi-turn inference.
How should I format the messages for this? Can anyone help?

Also, I have another question. Is there a prompt to prevent the model from using specific words? I’ve tried using phrases like ‘do not use these terms: aaa, bbb, ccc’, but the model keeps outputting those words. It seems like the model isn’t recognizing the negative instruction well, and I’m not sure how to make it adhere to this rule.

Additionally, how should I provide a system prompt?

To summarize,

How to perform multi-turn inference and format the messages (chat template).
How to prevent the model from using specific unwanted words in its responses.
How to change or add a system prompt

Topic		Replies	Views
Multi-turn dialogue using dialoGPT with Hosted Inference API Beginners	3	1066	July 31, 2020
Inference API detailed request Beginners	5	2265	September 11, 2020
Diffuser API Inference Community Limited to 1 Image Return Inference Endpoints on the Hub	0	481	April 8, 2023
Inference Api free rate limit Inference Endpoints on the Hub	0	1916	May 20, 2023
Model that can generate both text and image as output Research	5	1482	December 31, 2024

Idefics2 multi turn inference

Related topics