Preparing data for Donut training results in error "ArrowInvalid: offset overflow while concatenating arrays"

Hi there,

I am following this notebook to train Donut model on custom data that is 5500 invoice images with dimensions 1200x1432px with respective json data files. Everything goes well until this line:

# need at least 32-64GB of RAM to run this
processed_dataset = proc_dataset.map(transform_and_tokenize,remove_columns=["image","text"])

Which, after running for 5-6 minutes results in the following error:

…
File /opt/conda/lib/python3.11/site-packages/pyarrow/table.pxi:4512, in pyarrow.lib.Table.combine_chunks()

File /opt/conda/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: offset overflow while concatenating arrays, consider casting input from list<item: list<item: list<item: float>>> to list<item: list<item: large_list<item: float>>> first.

If I train the model on the dataset that is used in the notebook I mentioned (SROIE dataset), then everything goes fine. The notebook reduces the image sizes to 720x960px but I need a slightly bigger width and height which is why I do 1200x1432px instead. Apart from this, and the amount of pictures, there really isn’t any difference that I can spot. I have 128 GB ram so that shouldn’t be the issue. The error itself has only 2-3 hits on google that I am too beginner to reconcile in terms of Python.

What can be the issue here?

1 Like

I found a similar symptom, but it seems to be an unresolved bug in PyArrow…

Thanks for helping!
One thing that finally worked is using batched mapping, but obviously with big SSD overhead:

proc_dataset.map(transform_and_tokenize, batched=True, batch_size=16)

But I guess it’s better than nothing.

1 Like