Hello,
I’m new to ML and this is probably a basic problem. I’m trying to fine tune Donut base model using my documents but getting errors.
Anaconda simplifies, safeguards, and accelerates open-source AI with a trusted platform, enabling secure scaling, real-time insights, and community collaboration.
The error is
TypeError: DonutSwinModel.forward() got an unexpected keyword argument ‘input_ids’
I’m generating a dataset using document images and annotations.jsonl with following data
{“label”: “{"load_id": "1234", "carrier_name": "Bison"}”, “image”: “TOUR_LOGISTICS_0.png”}
My dataset has
{
“pixel_values”: batch[“pixel_values”],
“decoder_input_ids”: batch[“decoder_input_ids”],
“labels”: batch[“labels”]
}
Isn’t Trainer process knows which field to use for Encoder and Decoder?
1 Like
It may be a version-dependent bug in Transoformers…
opened 08:07PM - 06 Feb 25 UTC
bug
### System Info
Reopening: https://github.com/huggingface/transformers/issues/3… 5838
### Who can help?
_No response_
### Information
- [ ] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)
### Reproduction
see https://github.com/huggingface/transformers/issues/35838
### Expected behavior
see https://github.com/huggingface/transformers/issues/35838
I downgraded to transformers==4.45.2 and it didn’t help.
1 Like
Hmm… Maybe the wrong collate_fn
is assigned…
Or maybe the model settings are wrong.
https://stackoverflow.com/questions/77182311/question-about-data-collator-throwing-a-key-error-in-hugging-face
opened 09:58PM - 27 Apr 23 UTC
closed 03:03PM - 07 Jul 23 UTC
Hello,
Thank you again for the fantastic work on this library and all the exa… mples you are including !!
Big up @younesbelkada for all the support as well...
I have been trying to play around with BLIP2 and PEFT using the example notebook (https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=6cCVhsmJxxjH) and a few things came up and I was hoping to get your help:
1. When trying to finetune with "salesforce/blip2-flan-t5-xl", I got a ton of issues:
```
config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
target_modules=["q_proj", "k_proj"])
```
The q_proj and k_proj layers don't exist and so I used "q","v" or tried to use just the default values and it made the loss converge to 0 extremely quickly. However, the model was really just outputting gibberish so I'm likely not using the right target_modules... How are you supposed to tweak this parameter? In general too, is there a heuristic for these such as T5 -> q,v , OPT -> q_proj,k_proj and is that different for the regular model vs BLIP2?
- I tried using a bigger OPT (i.e. "ybelkada/blip2-opt-2.7b-fp16-sharded" or "ybelkada/blip2-opt-2.7b-fp16-sharded") and that just made the loss train with "nan" all the time regardless of what I tried.
2. Something seemed really odd in the training loop, specifically: `outputs = model(input_ids=input_ids,
pixel_values=pixel_values,
labels=input_ids)`
- From my understanding, this would imply that we are already passing the label into the model that we want to predict as an input?
- I also tried to modify the notebook to go beyond just image captioning and try to train a VQA model by modifying the following:
```
class ImageCaptioningDataset(Dataset):
def __init__(self, dataset, processor):
self.dataset = dataset
self.processor = processor
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
item = self.dataset[idx]
encoding = self.processor(images=item["image"],text=item['prompt'], padding="max_length", return_tensors="pt")
# remove batch dimension
encoding = {k: v.squeeze() for k, v in encoding.items()}
encoding["text"] = item["text"]
return encoding
def collate_fn(batch):
# pad the input_ids and attention_mask
processed_batch = {}
for key in batch[0].keys():
if key in ["pixel_values",'input_ids']:
processed_batch[key] = torch.stack([example[key] for example in batch])
elif key == 'text':
text_inputs = processor.tokenizer(
[example["text"] for example in batch], padding=True, return_tensors="pt"
)
processed_batch["input_ids_label"] = text_inputs["input_ids"]
processed_batch["attention_mask_label"] = text_inputs["attention_mask"]
return processed_batch
input_ids = batch.pop("input_ids").to(device)
input_ids_label = batch.pop("input_ids_label").to(device)
pixel_values = batch.pop("pixel_values").to(device, torch.float16)
outputs = model(input_ids=input_ids,
pixel_values=pixel_values,
labels=input_ids_label)
```
But then it didn't really seem to converge as well as the regular image captioning despite always having the same prompt throughout my dataset... Anything I could be doing wrong?
Thanks in advance!
Isn’t Trainer process knows which field to use for Encoder and Decoder?
By default, it seems to be manual or fixed.