I am following this notebook to train Donut model on custom data that is 5500 invoice images with dimensions 1200x1432px with respective json data files. Everything goes well until this line:
# need at least 32-64GB of RAM to run this
processed_dataset = proc_dataset.map(transform_and_tokenize,remove_columns=["image","text"])
Which, after running for 5-6 minutes results in the following error:
…
File /opt/conda/lib/python3.11/site-packages/pyarrow/table.pxi:4512, in pyarrow.lib.Table.combine_chunks()
File /opt/conda/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/conda/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
ArrowInvalid: offset overflow while concatenating arrays, consider casting input from list<item: list<item: list<item: float>>> to list<item: list<item: large_list<item: float>>> first.
If I train the model on the dataset that is used in the notebook I mentioned (SROIE dataset), then everything goes fine. The notebook reduces the image sizes to 720x960px but I need a slightly bigger width and height which is why I do 1200x1432px instead. Apart from this, and the amount of pictures, there really isn’t any difference that I can spot. I have 128 GB ram so that shouldn’t be the issue. The error itself has only 2-3 hits on google that I am too beginner to reconcile in terms of Python.