Issues when fine tuning Llama-3.2-11B-Vision

My goal is to fine tune meta-llama/Llama-3.2-11B-Vision so that it can recognize specific brand names in images. I have a large training set (~30,000 examples) with both positive and negative examples.

I successfully fine tuned the model using LoRA with an A100 GPU and 4-bit quantization. I load it by loading the base model then applying the fine tuned weights using PeftModel

I’ve run into two problems:

  1. When I use the model for inference, I get correct output but with weird repetition. For example, it might output something like:

What brand is represented in this image? Respond with only the official brand name, or 'no brand' if none is present. <OCR/> SOME BRAND NAME. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

SOME BRAND NAME is the correct output but it outputs that weird <OCR/> tag before it and a row of dashes afterwards.

  1. I tried setting repetition_penalty=1.2 in the generate() call but then I get a CUDA crash:
RuntimeError Traceback (most recent call last)

[<ipython-input-9-89460788cfae>](https://i5y0lzsary-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250505-060110_RC00_754880401#) in <cell line: 0>() 3 print("Max input ID:", inputs["input_ids"].max().item()) 4 print("Tokenizer vocab size:", tokenizer.vocab_size) ----> 5 generated_ids = model.generate( 6 **inputs, 7 max_new_tokens=100,

---
5 frames
---

[/usr/local/lib/python3.11/dist-packages/transformers/generation/logits_process.py](https://i5y0lzsary-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250505-060110_RC00_754880401#) in __call__(self, input_ids, scores) 338 339 # if score < 0 then repetition penalty has to be multiplied to reduce the token probabilities --> 340 score = torch.where(score < 0, score * self.penalty, score / self.penalty) 341 342 scores_processed = scores.scatter(1, input_ids, score)

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Does anyone have any idea what might be causing the weird tokens and repetition in the output? And why CUDA crashing when I try to use repetition_penalty?

1 Like

One other data point. For some images I get output like:

THE BEST BRAND. <OCR/> THE BEST BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND.

Once again, the output (β€œTHE BEST BRAND”) is correct but the tag is unexpected as is the repetition after outputting the answer.

1 Like

I should also mention that my research suggests that that CUDA crash is caused by the model using more tokens than the tokenizers knows about:

inputs["input_ids"].max().item() = 128256
tokenizer.vocab_size = 128000

However I’m not sure why this is causing it to fail or how to fix it.

1 Like

Errors related to repetition_penalty seems to be constraints in Llava or Llama Vision models.

And perhaps:

you can use return_full_text=False