What loss type should be used to train vision-llm with auto regression style?

What loss type should be used to train vision-llm with auto regression style?

2 Likes
  • The first one is the cross entropy loss function.
    This loss computes the negative log likelihood of predicting the next token (or visual/text embedding) given all the previous tokens in the sequence.

  • You can combine two loss functions.
    For vision tasks, images are often tokenized into patches (e.g., using a Vision Transformer (ViT)) or encoded into embeddings. Text tokens (if present) are tokenized using a standard tokenizer like BPE or WordPiece. So you can combine these two loss functions.

If you’re using Hugging Face’s transformers library for a Vision-LLM:

python

Copy code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("model_name")
tokenizer = AutoTokenizer.from_pretrained("model_name")

# Prepare inputs (vision + text tokens, already tokenized)
inputs = ...  # Combine vision and text tokens
labels = inputs.clone()  # Labels are the same as inputs but shifted for autoregression

# Forward pass
outputs = model(inputs, labels=labels)
loss = outputs.loss  # Autoregressive causal LM loss
1 Like