Hello,
I’m new to ML and this is probably a basic problem. I’m trying to fine tune Donut base model using my documents but getting errors.
Anaconda simplifies, safeguards, and accelerates open-source AI with a trusted platform, enabling secure scaling, real-time insights, and community collaboration.
The error is
TypeError: DonutSwinModel.forward() got an unexpected keyword argument ‘input_ids’
I’m generating a dataset using document images and annotations.jsonl with following data
{“label”: “{"load_id": "1234", "carrier_name": "Bison"}”, “image”: “TOUR_LOGISTICS_0.png”}
My dataset has
{
“pixel_values”: batch[“pixel_values”],
“decoder_input_ids”: batch[“decoder_input_ids”],
“labels”: batch[“labels”]
}
Isn’t Trainer process knows which field to use for Encoder and Decoder?
1 Like
It may be a version-dependent bug in Transoformers…
opened 08:07PM - 06 Feb 25 UTC
bug
### System Info
Reopening: https://github.com/huggingface/transformers/issues/3… 5838
### Who can help?
_No response_
### Information
- [ ] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)
### Reproduction
see https://github.com/huggingface/transformers/issues/35838
### Expected behavior
see https://github.com/huggingface/transformers/issues/35838
I downgraded to transformers==4.45.2 and it didn’t help.
1 Like
Hmm… Maybe the wrong collate_fn
is assigned…
Or maybe the model settings are wrong.
https://stackoverflow.com/questions/77182311/question-about-data-collator-throwing-a-key-error-in-hugging-face
opened 09:58PM - 27 Apr 23 UTC
closed 03:03PM - 07 Jul 23 UTC
Hello,
Thank you again for the fantastic work on this library and all the exa… mples you are including !!
Big up @younesbelkada for all the support as well...
I have been trying to play around with BLIP2 and PEFT using the example notebook (https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=6cCVhsmJxxjH) and a few things came up and I was hoping to get your help:
1. When trying to finetune with "salesforce/blip2-flan-t5-xl", I got a ton of issues:
```
config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
target_modules=["q_proj", "k_proj"])
```
The q_proj and k_proj layers don't exist and so I used "q","v" or tried to use just the default values and it made the loss converge to 0 extremely quickly. However, the model was really just outputting gibberish so I'm likely not using the right target_modules... How are you supposed to tweak this parameter? In general too, is there a heuristic for these such as T5 -> q,v , OPT -> q_proj,k_proj and is that different for the regular model vs BLIP2?
- I tried using a bigger OPT (i.e. "ybelkada/blip2-opt-2.7b-fp16-sharded" or "ybelkada/blip2-opt-2.7b-fp16-sharded") and that just made the loss train with "nan" all the time regardless of what I tried.
2. Something seemed really odd in the training loop, specifically: `outputs = model(input_ids=input_ids,
pixel_values=pixel_values,
labels=input_ids)`
- From my understanding, this would imply that we are already passing the label into the model that we want to predict as an input?
- I also tried to modify the notebook to go beyond just image captioning and try to train a VQA model by modifying the following:
```
class ImageCaptioningDataset(Dataset):
def __init__(self, dataset, processor):
self.dataset = dataset
self.processor = processor
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
item = self.dataset[idx]
encoding = self.processor(images=item["image"],text=item['prompt'], padding="max_length", return_tensors="pt")
# remove batch dimension
encoding = {k: v.squeeze() for k, v in encoding.items()}
encoding["text"] = item["text"]
return encoding
def collate_fn(batch):
# pad the input_ids and attention_mask
processed_batch = {}
for key in batch[0].keys():
if key in ["pixel_values",'input_ids']:
processed_batch[key] = torch.stack([example[key] for example in batch])
elif key == 'text':
text_inputs = processor.tokenizer(
[example["text"] for example in batch], padding=True, return_tensors="pt"
)
processed_batch["input_ids_label"] = text_inputs["input_ids"]
processed_batch["attention_mask_label"] = text_inputs["attention_mask"]
return processed_batch
input_ids = batch.pop("input_ids").to(device)
input_ids_label = batch.pop("input_ids_label").to(device)
pixel_values = batch.pop("pixel_values").to(device, torch.float16)
outputs = model(input_ids=input_ids,
pixel_values=pixel_values,
labels=input_ids_label)
```
But then it didn't really seem to converge as well as the regular image captioning despite always having the same prompt throughout my dataset... Anything I could be doing wrong?
Thanks in advance!
Isn’t Trainer process knows which field to use for Encoder and Decoder?
By default, it seems to be manual or fixed.
LeyaLi
June 12, 2025, 8:05am
5
I ran into the same issue and was able to resolve it. It turns out to be related to the Donut model’s MAX_TOKEN_LEN
setting. My code runs successfully when MAX_TOKEN_LEN
is set to 128 or lower, but the bug reappears as soon as it exceeds 128.
1 Like