LoadDataSet pyarrow.lib.ArrowCapacityError

Dammond · January 12, 2025, 4:14am

I use

data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=“train”)

Report when loading dataset (approximately 84GB)

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 10761561509，

Try setting up according to the help provided in other posts

set(data_set[“hash”])

I still haven’t solved the above problem. Do you have any ways to help me solve it? Thank you!

My version information is as follows

datasets version: 3.2.0
Platform: Linux-4.19.91-014.15-kangaroo.alios7.x86_64-x86_64-with-glibc2.35
Python version: 3.11.10
huggingface_hub version: 0.26.5
PyArrow version: 17.0.0
Pandas version: 2.2.3
fsspec version: 2024.2.0

John6666 · January 12, 2025, 4:51am

Apparently, this is an issue with PyArrow, and although some of it has been resolved, it still seems to be unresolved. @lhoestq

github.com/huggingface/datasets

Loading big dataset raises pyarrow.lib.ArrowNotImplementedError

opened 02:42PM - 02 Apr 23 UTC

closed 08:04AM - 10 Apr 23 UTC

amariucaitheodor

### Describe the bug Calling `datasets.load_dataset` to load the (publicly avai…lable) dataset `theodor1289/wit` fails with `pyarrow.lib.ArrowNotImplementedError`. ### Steps to reproduce the bug Steps to reproduce this behavior: 1. `!pip install datasets` 2. `!huggingface-cli login` 3. This step will throw the error (it might take a while as the dataset has ~170GB): ```python from datasets import load_dataset dataset = load_dataset("theodor1289/wit", "train", use_auth_token=True) ``` Stack trace: ``` (torch-multimodal) bash-4.2$ python test.py Downloading and preparing dataset None/None to /cluster/work/cotterell/tamariucai/HuggingfaceDatasets/theodor1289___parquet/theodor1289--wit-7a3e984414a86a0f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec... Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 491.68it/s] Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 16.93it/s] Traceback (most recent call last): File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split_single for _, table in generator: File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 69, in _generate_tables for batch_idx, record_batch in enumerate( File "pyarrow/_parquet.pyx", line 1323, in iter_batches File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/cluster/work/cotterell/tamariucai/multimodal-mirror/examples/test.py", line 2, in <module> dataset = load_dataset("theodor1289/wit", "train", use_auth_token=True) File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/load.py", line 1791, in load_dataset builder_instance.download_and_prepare( File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 891, in download_and_prepare self._download_and_prepare( File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 986, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1748, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1893, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset ``` ### Expected behavior The dataset is loaded in variable `dataset`. ### Environment info - `datasets` version: 2.11.0 - Platform: Linux-3.10.0-1160.80.1.el7.x86_64-x86_64-with-glibc2.17 - Python version: 3.10.4 - Huggingface_hub version: 0.13.3 - PyArrow version: 11.0.0 - Pandas version: 1.5.3

Dammond · January 12, 2025, 4:59am

Yes, I have seen similar posts with the same issue:

Minhash Deduplication - #11 by conceptofmind

But I tried this method and it didn’t solve the error.
May I ask if there is any way you can help me solve this problem?
Thank you!

John6666 · January 12, 2025, 5:18am

How about trying .shard()?

Dammond · January 12, 2025, 5:45am

Thank you for your reply.
Is. shard() partitioned only when generating the load_dataset() object

But this error occurred during load_dataset()

John6666 · January 12, 2025, 5:58am

It may also be another limitation of PyArrow. If you set num_shards to around 20, maybe it will work… I hope it does.

github.com/ray-project/ray

[Data] Overflowing Pyarrow buffer when using `read_binary` to read files > 2Gb

opened 09:42PM - 29 Oct 24 UTC

closed 05:57AM - 19 Nov 24 UTC

tmbdev

enhancement P0 data

### Description I have 1100 shards of 3 Gbytes of data each that I read with …read_text. I cannot find any way to process them with ray.data. I consistently get these errors: ``` pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays ``` The problem seems to be that pyarrow internally is limited to 2 Gbyte offsets. This seems like a pretty serious limitation given modern processors and datasets. It wouldn't be quite so serious if the ray.data.read_... routines consistently had a chunk_size or block_size argument that would permit on the fly splitting of files into multiple blocks, but they don't. I don't even seem to be able to handle this by using `read_binary_file` with `flat_map`, since even the output of `read_binary_file` is represented as a PyArrow table and can't hold the raw binary data. ``` File "pyarrow/array.pxi", line 358, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2626945332 ``` My suggestions would be: (1) add optional block_size arguments to all readers (2) consider the support of a different/additional internal format other than PyArrow, a format not limited by 32 bit offsets Somewhat related: (3) instead of dealing with data partitioning in terms of total number of blocks (which is hard to do for datasets of unknown size), handle data partitioning in terms of block sizes. (4) implement a streaming_repartition(block_size) ### Use case I'm trying to read a sharded JSONL datasets whose shards happen to expand into data larger than 2GB when decompressed. Processing this dataset with ray.data seems impossible, since it is impossible even to read it with read_text.

Dammond · January 12, 2025, 6:33am

I have already set num_stards to 100, but the same error still exists

data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=“train”)
data_set = data_set.shard(num_shards=100, index=0)

It seems that this error already exists when executing load_dataset()

Topic		Replies	Views
Strange pyarrow error when extracting rows from a public dataset Intermediate	2	42	April 30, 2025
Unable to Load Dataset Using `load_dataset` 🤗Datasets	10	246	March 11, 2025
Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray) 🤗Datasets	2	1132	January 25, 2022
Exceeded maximum rows when load_dataset for JSON 🤗Datasets	4	1138	April 6, 2023
Arrowmemoryerror: realloc of size 32 GB failed 🤗Datasets	2	3241	January 6, 2023

LoadDataSet pyarrow.lib.ArrowCapacityError

Related topics