Error fine tuning Donut model using LoRA

DenisMir · June 1, 2025, 10:33pm

Hello,
I’m new to ML and this is probably a basic problem. I’m trying to fine tune Donut base model using my documents but getting errors.

The error is
TypeError: DonutSwinModel.forward() got an unexpected keyword argument ‘input_ids’

I’m generating a dataset using document images and annotations.jsonl with following data
{“label”: “{"load_id": "1234", "carrier_name": "Bison"}”, “image”: “TOUR_LOGISTICS_0.png”}

My dataset has
{
“pixel_values”: batch[“pixel_values”],
“decoder_input_ids”: batch[“decoder_input_ids”],
“labels”: batch[“labels”]
}
Isn’t Trainer process knows which field to use for Encoder and Decoder?

John6666 · June 2, 2025, 4:03am

It may be a version-dependent bug in Transoformers…

github.com/huggingface/transformers

TypeError: ModernBertModel.forward() got an unexpected keyword argument 'num_items_in_batch'

opened 08:07PM - 06 Feb 25 UTC

Bachstelze

bug

### System Info Reopening: https://github.com/huggingface/transformers/issues/3…5838 ### Who can help? _No response_ ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction see https://github.com/huggingface/transformers/issues/35838 ### Expected behavior see https://github.com/huggingface/transformers/issues/35838

DenisMir · June 2, 2025, 12:47pm

I downgraded to transformers==4.45.2 and it didn’t help.

John6666 · June 3, 2025, 12:08am

Hmm… Maybe the wrong collate_fn is assigned…
Or maybe the model settings are wrong.

https://stackoverflow.com/questions/77182311/question-about-data-collator-throwing-a-key-error-in-hugging-face

github.com/huggingface/peft

FineTuning BLIP2 - various issues

opened 09:58PM - 27 Apr 23 UTC

closed 03:03PM - 07 Jul 23 UTC

iliasmiraoui

Hello, Thank you again for the fantastic work on this library and all the exa…mples you are including !! Big up @younesbelkada for all the support as well... I have been trying to play around with BLIP2 and PEFT using the example notebook (https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=6cCVhsmJxxjH) and a few things came up and I was hoping to get your help: 1. When trying to finetune with "salesforce/blip2-flan-t5-xl", I got a ton of issues: ``` config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", target_modules=["q_proj", "k_proj"]) ``` The q_proj and k_proj layers don't exist and so I used "q","v" or tried to use just the default values and it made the loss converge to 0 extremely quickly. However, the model was really just outputting gibberish so I'm likely not using the right target_modules... How are you supposed to tweak this parameter? In general too, is there a heuristic for these such as T5 -> q,v , OPT -> q_proj,k_proj and is that different for the regular model vs BLIP2? - I tried using a bigger OPT (i.e. "ybelkada/blip2-opt-2.7b-fp16-sharded" or "ybelkada/blip2-opt-2.7b-fp16-sharded") and that just made the loss train with "nan" all the time regardless of what I tried. 2. Something seemed really odd in the training loop, specifically: `outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=input_ids)` - From my understanding, this would imply that we are already passing the label into the model that we want to predict as an input? - I also tried to modify the notebook to go beyond just image captioning and try to train a VQA model by modifying the following: ``` class ImageCaptioningDataset(Dataset): def __init__(self, dataset, processor): self.dataset = dataset self.processor = processor def __len__(self): return len(self.dataset) def __getitem__(self, idx): item = self.dataset[idx] encoding = self.processor(images=item["image"],text=item['prompt'], padding="max_length", return_tensors="pt") # remove batch dimension encoding = {k: v.squeeze() for k, v in encoding.items()} encoding["text"] = item["text"] return encoding def collate_fn(batch): # pad the input_ids and attention_mask processed_batch = {} for key in batch[0].keys(): if key in ["pixel_values",'input_ids']: processed_batch[key] = torch.stack([example[key] for example in batch]) elif key == 'text': text_inputs = processor.tokenizer( [example["text"] for example in batch], padding=True, return_tensors="pt" ) processed_batch["input_ids_label"] = text_inputs["input_ids"] processed_batch["attention_mask_label"] = text_inputs["attention_mask"] return processed_batch input_ids = batch.pop("input_ids").to(device) input_ids_label = batch.pop("input_ids_label").to(device) pixel_values = batch.pop("pixel_values").to(device, torch.float16) outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=input_ids_label) ``` But then it didn't really seem to converge as well as the regular image captioning despite always having the same prompt throughout my dataset... Anything I could be doing wrong? Thanks in advance!

Isn’t Trainer process knows which field to use for Encoder and Decoder?

By default, it seems to be manual or fixed.

LeyaLi · June 12, 2025, 8:05am

I ran into the same issue and was able to resolve it. It turns out to be related to the Donut model’s MAX_TOKEN_LEN setting. My code runs successfully when MAX_TOKEN_LEN is set to 128 or lower, but the bug reappears as soon as it exceeds 128.

John6666 · June 12, 2025, 9:19am

Oh… I see.

Max_length refers to the maximum number of tokens that the text decoder of Donut can generate.
The authors set the max_length depending on the downstream task, e.g. for DocVQA it is set to 128, for RVL-CDIP it is set to 8, etc.

Topic		Replies	Views
[DONUT] Typo errors - Document parsing 🤗Transformers	1	526	September 10, 2024
[SOLVED] DONUT Fine-tuning error, following documentation Beginners	0	139	June 24, 2024
Donut Pre-Train on new Language 🤗Transformers	4	2371	July 1, 2025
Finetune Donut with new tokenizer Intermediate	6	2699	October 10, 2023
Finetuning Donut Transformer on DocParsing Beginners	0	877	October 23, 2023

Error fine tuning Donut model using LoRA

Related topics