How to count input tokens in vision model?

maria-ai · June 1, 2024, 10:56pm

I used apply_chat_template for text model:

tokens = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
tokens_in = len(tokens)

But code fails when message have content list:

{'role': 'system', 'content': [{'type': 'text', 'text': 'You are assistant'}]}, {'role': 'user', 'content': [{'type': 'text', 'text': '1'}]}]

I need to count input tokens for images and text.

RaushanTurganbay · June 2, 2024, 6:39am

Hi! What model you’re using for the chat template?

Not all vision LLMs currently support chat template. The one you have as an example should work for Idefics models. In case you want to add a chat template for other VLMs, you can write a jinja template yourself and do as below, where chat_template is your own jinja template.

tokens = tokenizer.apply_chat_template(messages, chat_template=chat_template, add_generation_prompt=True)

For example here is the template used for Idefics, but you should make sure to use the chat format that your model was trained on for better performance.

Also see the docs on writing you own template

maria-ai · June 2, 2024, 1:49pm

I accidentally use wrong tokenizer, i use https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers/tree/main
Template is working, but now i have another issue. For 280x420 and small promt I get 96000 tokens (of course it is more then context, but generation is working). I need to make pricing according amount of tokens but I don’t understand how image processing, I saw that OpenAI have pricing for tiles.

Matheo-Bolland · December 16, 2024, 9:10am

Hello @maria-ai ,
How did you manage to count tokens you give to your model ?
Thank you

maria-ai · December 19, 2024, 10:06pm

I fixed it. Now i count it with:

def num_tokens_from_messages_local(messages, tokenizer_model):
    visual_tokens = 0
    messages_to_count = copy.deepcopy(messages)
    for m in messages_to_count:
        if isinstance(m['content'], list):
            set_text = ''
            for m_content in m['content']:
                if m_content['type'] == 'text':
                    set_text = m_content['text']
                elif m_content['type'] == 'image_url':
                    visual_tokens += 576 # FIXME: magic number
            m['content'] = set_text
    tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_model)
    tokens = tokenizer.apply_chat_template(messages_to_count, add_generation_prompt=True)
    tokens_len = len(tokens)
    # print("TOKENS_LEN", tokens_len)
    return tokens_len + visual_tokens

Topic		Replies	Views
Phi3 vision number of tokens Models	1	226	June 18, 2024
Llama inference with apply_chat_template Beginners	0	216	November 30, 2024
Llama model outputs strange words Beginners	0	130	December 1, 2024
When I using the chat_template of llama 2 tokenizer the response of IT model is nothing 🤗Tokenizers	0	111	July 13, 2024
Usage issue regarding Mistral 🤗Transformers	0	441	March 1, 2024

How to count input tokens in vision model?

Related topics