LoadDataSet pyarrow.lib.ArrowCapacityError

I use

data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=β€œtrain”)

Report when loading dataset (approximately 84GB)

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 10761561509,

Try setting up according to the help provided in other posts

set(data_set[β€œhash”])

I still haven’t solved the above problem. Do you have any ways to help me solve it? Thank you!

My version information is as follows

  • datasets version: 3.2.0
  • Platform: Linux-4.19.91-014.15-kangaroo.alios7.x86_64-x86_64-with-glibc2.35
  • Python version: 3.11.10
  • huggingface_hub version: 0.26.5
  • PyArrow version: 17.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.2.0
1 Like

Apparently, this is an issue with PyArrow, and although some of it has been resolved, it still seems to be unresolved. @lhoestq

Yes, I have seen similar posts with the same issue:

Minhash Deduplication - #11 by conceptofmind

But I tried this method and it didn’t solve the error.
May I ask if there is any way you can help me solve this problem?
Thank you!

1 Like

How about trying .shard()?

Thank you for your reply.
Is. shard() partitioned only when generating the load_dataset() object


But this error occurred during load_dataset()

1 Like

It may also be another limitation of PyArrow. If you set num_shards to around 20, maybe it will work… I hope it does.

I have already set num_stards to 100, but the same error still exists

data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=β€œtrain”)
data_set = data_set.shard(num_shards=100, index=0)

It seems that this error already exists when executing load_dataset()

1 Like