Error While Saving Dataset with PyArrow

Hi everyone,

Iā€™m running into an issue when saving a Hugging Face dataset containing images of PDF documents (total ~200GB). Hereā€™s the code Iā€™m using:


from datasets import Dataset, Features, Value, Sequence, Image
import pickle

features = Features({
    "source": Value("string"),
    "split": Value("string"),
    "doc_id": Value("string"),
    "doc_images": Sequence(Image()),
    "doc_ocr": Sequence(Value("string")),
    "questions": Value("string")
})

with open('examples.pkl', 'rb') as f:
    examples = pickle.load(f)

print(len(examples))  # 48151 examples
dataset = Dataset.from_list(examples, features=features)
print(dataset[0]["doc_images"])  # Images load fine
dataset.save_to_disk("dataset_after_map")

I get the following error during save_to_disk:

......
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
  File "pyarrow/array.pxi", line 4021, in pyarrow.lib.StructArray.from_arrays
  File "pyarrow/array.pxi", line 4501, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

Details:

Dataset contains images as PIL.Image objects.
Images load and display correctly in memory.
Total size of dataset: ~200GB.

What Iā€™ve Tried:

  • Verified that examples.pkl is loading properly.
  • Images seem valid (confirmed via print(dataset[0][ā€œdoc_imagesā€])).
  • Looked into PyArrow errors but canā€™t figure out a fix.

Questions:

  • Is this related to the dataset size, or an issue with Image or PyArrow?
  • Any workarounds to save such a large dataset?

Thanks for your help!

1 Like