LayoutLMv3 processor error

I am currently using the finetuned LayoutLMv3 on the FUNSD dataset.

When I was using the model for new images, I noticed a problem using the processor.

encoding = processor(resized_image, words, boxes=boxes, return_offsets_mapping=True, return_tensors=“pt”)

The elements of boxes can only be less than 1000. so I have resized the boxed and also the image to have a max dim of 1000.

Is that the correct way of doing things?

Then, I have encountered an error during

with torch.no_grad():
outputs = model(**encoding)

“IndexError: index out of range in self”

Can anybody explain to me why this error?
encoding = processor(resized_image, words, boxes=boxes, return_offsets_mapping=True, return_tensors=“pt”)

And lastlyb are all the images resized to (224,224)?
Here are the encodings.
for k,v in encoding.items():
print(k,v.shape)

Outputs:
input_ids torch.Size([1, 795])
attention_mask torch.Size([1, 795])
offset_mapping torch.Size([1, 795, 2])
bbox torch.Size([1, 795, 4])
pixel_values torch.Size([1, 3, 224, 224])

1 Like

Apparently there is a precedent. It seems that the dataset and the model are incompatible. You will probably need to normalize the dataset manually.

1 Like

For me the normalization was not the problem.
As mentioned in the above posts one of the recurrring problems was that the bounding boxes were too small.

That was not the problem for me:

code to check if each bbox is at least (1,1)

for bbox in bounding_boxes:
assert bbox[2] - bbox[0] > 1
assert bbox[3] - bbox[1] > 1

The problem was that the embedding layer in model wass not accepting the input ids in the data sample. This generally happens when the length of data sample is more than 512. one has to set the truncate parameter to True. So that the length never more than 512. Mine was 700.

encoding = processor(original_image, words, boxes=boxes, return_offsets_mapping=True, max_length=512, padding=“max_length”, truncation=True, return_tensors=“pt”)

But still have not figured out why resize to 224,224.

Thanks John6666

1 Like

I’m glad you were able to resolve some of this.

But still have not figured out why resize to 224,224.

Other models such as the SigLip, for example, also resize to that size, though not exactly the same size, so perhaps the current model is designed for that level of resolution.
However, there are a lot of things I don’t understand, such as why it is necessary to reduce the resolution that much, and which one gives better results if it is stretching, padding, or cropping.
Well, I’m not mainly dealing with VLM or LLM, so I’m going to assume that as long as there are no problems with the operation, it’s fine.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.