What loss type should be used to train vision-llm with auto regression style?
2 Likes
-
The first one is the cross entropy loss function.
This loss computes the negative log likelihood of predicting the next token (or visual/text embedding) given all the previous tokens in the sequence. -
You can combine two loss functions.
For vision tasks, images are often tokenized into patches (e.g., using a Vision Transformer (ViT)) or encoded into embeddings. Text tokens (if present) are tokenized using a standard tokenizer like BPE or WordPiece. So you can combine these two loss functions.
If you’re using Hugging Face’s transformers
library for a Vision-LLM:
python
Copy code
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("model_name")
tokenizer = AutoTokenizer.from_pretrained("model_name")
# Prepare inputs (vision + text tokens, already tokenized)
inputs = ... # Combine vision and text tokens
labels = inputs.clone() # Labels are the same as inputs but shifted for autoregression
# Forward pass
outputs = model(inputs, labels=labels)
loss = outputs.loss # Autoregressive causal LM loss
1 Like