Preparing data for Donut training results in error "ArrowInvalid: offset overflow while concatenating arrays"

gandg · January 11, 2025, 5:29pm

Hi there,

I am following this notebook to train Donut model on custom data that is 5500 invoice images with dimensions 1200x1432px with respective json data files. Everything goes well until this line:

# need at least 32-64GB of RAM to run this
processed_dataset = proc_dataset.map(transform_and_tokenize,remove_columns=["image","text"])

Which, after running for 5-6 minutes results in the following error:

…
File /opt/conda/lib/python3.11/site-packages/pyarrow/table.pxi:4512, in pyarrow.lib.Table.combine_chunks()

File /opt/conda/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: offset overflow while concatenating arrays, consider casting input from list<item: list<item: list<item: float>>> to list<item: list<item: large_list<item: float>>> first.

If I train the model on the dataset that is used in the notebook I mentioned (SROIE dataset), then everything goes fine. The notebook reduces the image sizes to 720x960px but I need a slightly bigger width and height which is why I do 1200x1432px instead. Apart from this, and the amount of pictures, there really isn’t any difference that I can spot. I have 128 GB ram so that shouldn’t be the issue. The error itself has only 2-3 hits on google that I am too beginner to reconcile in terms of Python.

What can be the issue here?

John6666 · January 12, 2025, 5:21am

I found a similar symptom, but it seems to be an unresolved bug in PyArrow…

github.com/ray-project/ray

pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays [<Ray component: Data>]

opened 01:29AM - 02 Sep 24 UTC

closed 04:47PM - 01 Oct 24 UTC

conceptofmind

bug P1 data

### What happened + What you expected to happen When mapping batches using hugg…ingface transformers over a ray dataset I receive an `offset overflow while concatenating arrays` error. Error: ``` arr = col.combine_chunks() File "pyarrow/table.pxi", line 754, in pyarrow.lib.ChunkedArray.combine_chunks File "pyarrow/array.pxi", line 4579, in pyarrow.lib.concat_arrays File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays ``` ### Versions / Dependencies Python 3.10 transformers ray[data] torch datasets ### Reproduction script Based on Huggingface fineweb classification code: ```python import torch import ray from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForSequenceClassification dataset = load_dataset("JeanKaddour/minipile") hf_subset = dataset["train"] # Works with smaller data: .select(range(512)) ray_ds = ray.data.from_huggingface(hf_subset) class HuggingFaceClassifier: def __init__(self): self.tokenizer = AutoTokenizer.from_pretrained( "HuggingFaceTB/fineweb-edu-classifier" ) self.model = AutoModelForSequenceClassification.from_pretrained( "HuggingFaceTB/fineweb-edu-classifier", torch_dtype=torch.bfloat16 ).to("cuda") self.model.eval() def __call__(self, batch): inputs = self.tokenizer( list(batch["text"]), return_tensors="pt", padding="longest", truncation=True, ).to("cuda") with torch.no_grad(): outputs = self.model(**inputs) logits = outputs.logits.squeeze(-1).float().cpu().numpy() batch["score"] = logits.tolist() batch["int_score"] = [int(round(max(0, min(score, 5)))) for score in logits] return batch predictions = ray_ds.map_batches( HuggingFaceClassifier, concurrency=1, num_gpus=1, batch_size=512, ) predictions.write_parquet("test_classifier") ``` ### Issue Severity Medium: It is a significant difficulty but I can work around it.

gandg · January 12, 2025, 8:16am

Thanks for helping!
One thing that finally worked is using batched mapping, but obviously with big SSD overhead:

proc_dataset.map(transform_and_tokenize, batched=True, batch_size=16)

But I guess it’s better than nothing.

Topic		Replies	Views
OverflowError: There was an overflow with type <class 'list'>. Try to reduce writer_batch_size to have batches smaller than 2GB 🤗Datasets	2	2342	May 14, 2022
.get_nearest_examples() throws ArrowInvalid: offset overflow while concatenating arrays 🤗Datasets	4	3072	September 30, 2020
Map method to tokenize raises index error 🤗Datasets	9	4294	June 9, 2021
Arrowmemoryerror: realloc of size 32 GB failed 🤗Datasets	2	3292	January 6, 2023
Datasets mapper hanging issue 🤗Datasets	2	1259	March 8, 2023

Preparing data for Donut training results in error "ArrowInvalid: offset overflow while concatenating arrays"

Related topics