Hi, I’m trying to create an image dataset with metadata, but I’m getting the error below which I think is because my metadata.jsonl
is large (714M). I’ve tried breaking it up a couple different ways (metadata_0.jsonl
and metadata_1.jsonl
, and metadata-00000-of-00002.jsonl
and metadata-00001-of-00002.jsonl
) but then it loaded without it:
>>> from datasets import load_dataset
>>> ds = load_dataset("imagefolder", data_dir="to-upload", split="train")
>>> ds[0]
{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1920x1080 at 0x79DA542B14E0>}
How do I work around this? If I break the whole dataset in two, then upload each half like:
for dir in ("upload-1", "upload-2"):
ds = load_dataset("imagefolder", data_dir="to-upload", split="train")
ds.push_to_hub("gbenson/webui-dom-snapshots")
will this concatenate the two uploads into one dataset, or something else?
Thanks,
Gary
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="to-upload") #, split="train")
Resolving data files: 100%|███████████████████████████████| 4537/4537 [00:00<00:00, 114285.46it/s]
Downloading data: 100%|████████████████████████████████| 4537/4537 [00:00<00:00, 410305.47files/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/load.py", line 2609, in load_dataset
builder_instance.download_and_prepare(
File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1027, in download_and_prepare
self._download_and_prepare(
File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1789, in _download_and_prepare
super()._download_and_prepare(
File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 175, in _split_generators
pa_metadata_table = self._read_metadata(downloaded_metadata_file, metadata_ext=metadata_ext)
File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 245, in _read_metadata
return paj.read_json(f)
File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)