My goal is to fine tune meta-llama/Llama-3.2-11B-Vision so that it can recognize specific brand names in images. I have a large training set (~30,000 examples) with both positive and negative examples.
I successfully fine tuned the model using LoRA with an A100 GPU and 4-bit quantization. I load it by loading the base model then applying the fine tuned weights using PeftModel
Iβve run into two problems:
When I use the model for inference, I get correct output but with weird repetition. For example, it might output something like:
What brand is represented in this image? Respond with only the official brand name, or 'no brand' if none is present. <OCR/> SOME BRAND NAME. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SOME BRAND NAME is the correct output but it outputs that weird <OCR/> tag before it and a row of dashes afterwards.
I tried setting repetition_penalty=1.2 in the generate() call but then I get a CUDA crash:
RuntimeError Traceback (most recent call last)
[<ipython-input-9-89460788cfae>](https://i5y0lzsary-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250505-060110_RC00_754880401#) in <cell line: 0>() 3 print("Max input ID:", inputs["input_ids"].max().item()) 4 print("Tokenizer vocab size:", tokenizer.vocab_size) ----> 5 generated_ids = model.generate( 6 **inputs, 7 max_new_tokens=100,
---
5 frames
---
[/usr/local/lib/python3.11/dist-packages/transformers/generation/logits_process.py](https://i5y0lzsary-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250505-060110_RC00_754880401#) in __call__(self, input_ids, scores) 338 339 # if score < 0 then repetition penalty has to be multiplied to reduce the token probabilities --> 340 score = torch.where(score < 0, score * self.penalty, score / self.penalty) 341 342 scores_processed = scores.scatter(1, input_ids, score)
RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Does anyone have any idea what might be causing the weird tokens and repetition in the output? And why CUDA crashing when I try to use repetition_penalty?