Error While Saving Dataset with PyArrow

simogiova · December 14, 2024, 3:52pm

Hi everyone,

I’m running into an issue when saving a Hugging Face dataset containing images of PDF documents (total ~200GB). Here’s the code I’m using:


from datasets import Dataset, Features, Value, Sequence, Image
import pickle

features = Features({
    "source": Value("string"),
    "split": Value("string"),
    "doc_id": Value("string"),
    "doc_images": Sequence(Image()),
    "doc_ocr": Sequence(Value("string")),
    "questions": Value("string")
})

with open('examples.pkl', 'rb') as f:
    examples = pickle.load(f)

print(len(examples))  # 48151 examples
dataset = Dataset.from_list(examples, features=features)
print(dataset[0]["doc_images"])  # Images load fine
dataset.save_to_disk("dataset_after_map")

I get the following error during save_to_disk:

......
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
  File "pyarrow/array.pxi", line 4021, in pyarrow.lib.StructArray.from_arrays
  File "pyarrow/array.pxi", line 4501, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

Details:

Dataset contains images as PIL.Image objects.
Images load and display correctly in memory.
Total size of dataset: ~200GB.

What I’ve Tried:

Verified that examples.pkl is loading properly.
Images seem valid (confirmed via print(dataset[0][“doc_images”])).
Looked into PyArrow errors but can’t figure out a fix.

Questions:

Is this related to the dataset size, or an issue with Image or PyArrow?
Any workarounds to save such a large dataset?

Thanks for your help!

Topic		Replies	Views
LoadDataSet pyarrow.lib.ArrowCapacityError 🤗Datasets	6	241	January 12, 2025
Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray) 🤗Datasets	2	1132	January 25, 2022
Arrowmemoryerror: realloc of size 32 GB failed 🤗Datasets	2	3235	January 6, 2023
Dataset too large error 🤗Datasets	1	822	March 15, 2023
Loading HF datasets with variable size array using pyarrow with the appropriate schema 🤗Datasets	0	37	November 11, 2024

Error While Saving Dataset with PyArrow

Related topics