What loss type should be used to train vision-llm with auto regression style?

lucasjin · December 14, 2024, 4:18pm

Alanturner2 · December 15, 2024, 1:53pm

The first one is the cross entropy loss function.
This loss computes the negative log likelihood of predicting the next token (or visual/text embedding) given all the previous tokens in the sequence.
You can combine two loss functions.
For vision tasks, images are often tokenized into patches (e.g., using a Vision Transformer (ViT)) or encoded into embeddings. Text tokens (if present) are tokenized using a standard tokenizer like BPE or WordPiece. So you can combine these two loss functions.

If you’re using Hugging Face’s transformers library for a Vision-LLM:

python

Copy code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("model_name")
tokenizer = AutoTokenizer.from_pretrained("model_name")

# Prepare inputs (vision + text tokens, already tokenized)
inputs = ...  # Combine vision and text tokens
labels = inputs.clone()  # Labels are the same as inputs but shifted for autoregression

# Forward pass
outputs = model(inputs, labels=labels)
loss = outputs.loss  # Autoregressive causal LM loss

Topic		Replies	Views
How can I know what loss function I am using? Beginners	1	85	May 25, 2025
Token probabilities don't agree with the output loss Beginners	1	1306	November 15, 2022
Create a weighted loss function to handle imbalance? 🤗Transformers	3	1283	May 21, 2025
Trainer code for token-wise prediction model Intermediate	0	435	June 6, 2022
What is `self.loss_function` in `forward()` of newly released LLM? 🤗Transformers	0	46	January 14, 2025

What loss type should be used to train vision-llm with auto regression style?

Related topics