Hi everyone,
Iām running into an issue when saving a Hugging Face dataset containing images of PDF documents (total ~200GB). Hereās the code Iām using:
from datasets import Dataset, Features, Value, Sequence, Image
import pickle
features = Features({
"source": Value("string"),
"split": Value("string"),
"doc_id": Value("string"),
"doc_images": Sequence(Image()),
"doc_ocr": Sequence(Value("string")),
"questions": Value("string")
})
with open('examples.pkl', 'rb') as f:
examples = pickle.load(f)
print(len(examples)) # 48151 examples
dataset = Dataset.from_list(examples, features=features)
print(dataset[0]["doc_images"]) # Images load fine
dataset.save_to_disk("dataset_after_map")
I get the following error during save_to_disk:
......
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
File "pyarrow/array.pxi", line 4021, in pyarrow.lib.StructArray.from_arrays
File "pyarrow/array.pxi", line 4501, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean
Details:
Dataset contains images as PIL.Image objects.
Images load and display correctly in memory.
Total size of dataset: ~200GB.
What Iāve Tried:
- Verified that examples.pkl is loading properly.
- Images seem valid (confirmed via print(dataset[0][ādoc_imagesā])).
- Looked into PyArrow errors but canāt figure out a fix.
Questions:
- Is this related to the dataset size, or an issue with Image or PyArrow?
- Any workarounds to save such a large dataset?
Thanks for your help!