Issues when fine tuning Llama-3.2-11B-Vision

mgertner · May 7, 2025, 2:47pm

My goal is to fine tune meta-llama/Llama-3.2-11B-Vision so that it can recognize specific brand names in images. I have a large training set (~30,000 examples) with both positive and negative examples.

I successfully fine tuned the model using LoRA with an A100 GPU and 4-bit quantization. I load it by loading the base model then applying the fine tuned weights using PeftModel

I’ve run into two problems:

When I use the model for inference, I get correct output but with weird repetition. For example, it might output something like:

What brand is represented in this image? Respond with only the official brand name, or 'no brand' if none is present. <OCR/> SOME BRAND NAME. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

SOME BRAND NAME is the correct output but it outputs that weird <OCR/> tag before it and a row of dashes afterwards.

I tried setting repetition_penalty=1.2 in the generate() call but then I get a CUDA crash:

RuntimeError Traceback (most recent call last)

[<ipython-input-9-89460788cfae>](https://i5y0lzsary-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250505-060110_RC00_754880401#) in <cell line: 0>() 3 print("Max input ID:", inputs["input_ids"].max().item()) 4 print("Tokenizer vocab size:", tokenizer.vocab_size) ----> 5 generated_ids = model.generate( 6 **inputs, 7 max_new_tokens=100,

---
5 frames
---

[/usr/local/lib/python3.11/dist-packages/transformers/generation/logits_process.py](https://i5y0lzsary-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20250505-060110_RC00_754880401#) in __call__(self, input_ids, scores) 338 339 # if score < 0 then repetition penalty has to be multiplied to reduce the token probabilities --> 340 score = torch.where(score < 0, score * self.penalty, score / self.penalty) 341 342 scores_processed = scores.scatter(1, input_ids, score)

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Does anyone have any idea what might be causing the weird tokens and repetition in the output? And why CUDA crashing when I try to use repetition_penalty?

mgertner · May 7, 2025, 2:55pm

One other data point. For some images I get output like:

THE BEST BRAND. <OCR/> THE BEST BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND. BRAND.

Once again, the output (“THE BEST BRAND”) is correct but the tag is unexpected as is the repetition after outputting the answer.

mgertner · May 7, 2025, 3:04pm

I should also mention that my research suggests that that CUDA crash is caused by the model using more tokens than the tokenizers knows about:

inputs["input_ids"].max().item() = 128256
tokenizer.vocab_size = 128000

However I’m not sure why this is causing it to fail or how to fix it.

John6666 · May 8, 2025, 4:45am

Errors related to repetition_penalty seems to be constraints in Llava or Llama Vision models.

github.com/huggingface/transformers

llama 3.2, inference error with "repetition_penalty" in generation_config

opened 04:45AM - 22 Oct 24 UTC

closed 09:11AM - 30 Oct 24 UTC

ruian1

bug Generation

### System Info ``` - `transformers` version: 4.45.1 - Platform: Linux-5.10….0-33-cloud-amd64-x86_64-with-glibc2.35 - Python version: 3.11.10 - Huggingface_hub version: 0.26.0 - Safetensors version: 0.4.5 - Accelerate version: 0.34.2 - Accelerate config: not found - PyTorch version (GPU?): 2.4.0+cu118 (True) - Tensorflow version (GPU?): 2.17.0 (False) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using distributed or parallel set-up in script?: <fill in> - Using GPU in script?: <fill in> - GPU type: NVIDIA A100-SXM4-80GB ``` ### Who can help? @zucchini-nlp I ran into an error when adding the parameter of 'repetition_penalty' into generation_config using the exmaple in https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct. I added a generation_config at the bottom ``` import requests import torch from PIL import Image from transformers import MllamaForConditionalGeneration, AutoProcessor model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct" model = MllamaForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) processor = AutoProcessor.from_pretrained(model_id) url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" image = Image.open(requests.get(url, stream=True).raw) messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "If I had to write a haiku for this one, it would be: "} ]} ] input_text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor( image, input_text, add_special_tokens=False, return_tensors="pt" ).to(model.device) #output = model.generate(**inputs, max_new_tokens=30) # this one works #print(processor.decode(output[0])) from transformers import GenerationConfig meta_config = { "bos_token_id": 128000, "do_sample": True, "eos_token_id": [128001, 128008, 128009], "pad_token_id": 128004, "temperature": 0.1, "top_p": 0.9, "transformers_version": "4.45.0.dev0", "max_new_tokens": 256, "repetition_penalty": 1.2, } generation_config = GenerationConfig(**meta_config) output = model.generate(**inputs, generation_config=generation_config) print(processor.decode(output[0])) ``` This would result an error of ``` Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:07<00:00, 1.53s/it] 2559 ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [5,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed. Traceback (most recent call last): File "/root/projects/llama_huggingface/evaluation.py", line 122, in <module> output = base_model.generate(**inputs, generation_config=generation_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 2048, in generate result = self._sample( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 3018, in _sample next_token_scores = logits_processor(input_ids, next_token_logits) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/generation/logits_process.py", line 104, in __call__ scores = processor(input_ids, scores) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/generation/logits_process.py", line 356, in __call__ score = torch.where(score < 0, score * self.penalty, score / self.penalty) ~~~~~~^~~~~~~~~~~~~~ RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` `meta_config` is taken from here https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/blob/main/generation_config.json ### Information - [X] The official example scripts - [X] My own modified scripts ### Tasks - [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction 1. construct a generation_config with 'repetition_penalty' in it as in my code above, at the bottom 2. run the generation ### Expected behavior Expect generation to run smoothly with repetition_penalty

John6666 · May 8, 2025, 4:55am

And perhaps:

you can use return_full_text=False

Topic		Replies	Views
Fine tune "meta-llama/Llama-2-7b-hf" Bug:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward) Beginners	15	182	December 6, 2024
Repetitive Answers From Fine-Tuned LLM Models	9	1191	March 28, 2025
Fine Tuning LLama 3.2 1B Quantized Memory Requirements Models	6	1425	June 16, 2025
Fine tune Meta-Llama-3.1-8B OOM error after the 1st training step Models	0	162	September 6, 2024
Fine-tuning don't work / bad results Beginners	5	1691	January 15, 2025

Issues when fine tuning Llama-3.2-11B-Vision

Related topics