How to split large metadata.jsonl for ImageFolder?

Hi, I’m trying to create an image dataset with metadata, but I’m getting the error below which I think is because my metadata.jsonl is large (714M). I’ve tried breaking it up a couple different ways (metadata_0.jsonl and metadata_1.jsonl, and metadata-00000-of-00002.jsonl and metadata-00001-of-00002.jsonl) but then it loaded without it:

>>> from datasets import load_dataset
>>> ds = load_dataset("imagefolder", data_dir="to-upload", split="train")
>>> ds[0]
{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1920x1080 at 0x79DA542B14E0>}

How do I work around this? If I break the whole dataset in two, then upload each half like:

for dir in ("upload-1", "upload-2"):
    ds = load_dataset("imagefolder", data_dir="to-upload", split="train")
    ds.push_to_hub("gbenson/webui-dom-snapshots")

will this concatenate the two uploads into one dataset, or something else?

Thanks,
Gary

>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="to-upload") #, split="train")
Resolving data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4537/4537 [00:00<00:00, 114285.46it/s]
Downloading data: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4537/4537 [00:00<00:00, 410305.47files/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/load.py", line 2609, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1789, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 175, in _split_generators
    pa_metadata_table = self._read_metadata(downloaded_metadata_file, metadata_ext=metadata_ext)
  File "/home/gary/projects/webui-dom-snapshots/.venv/lib/python3.10/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 245, in _read_metadata
    return paj.read_json(f)
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)

Look away now, I got it working like this:

import os
import pyarrow.json as paj

from datasets import load_dataset

_paj_read_json = paj.read_json

def _read_json(*args, **kwargs):
    if len(args) == 1 and not kwargs:
        file_size = os.stat(args[0].fileno()).st_size
        read_options = paj.ReadOptions(block_size=file_size)
        kwargs["read_options"] = read_options
    return _paj_read_json(*args, **kwargs)

paj.read_json = _read_json


def main():
    ds = load_dataset("imagefolder", data_dir="to-upload")
    ds.push_to_hub("gbenson/webui-dom-snapshots")

Fix submitted upstream: