I use
data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=βtrainβ)
Report when loading dataset (approximately 84GB)
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 10761561509οΌ
Try setting up according to the help provided in other posts
set(data_set[βhashβ])
I still havenβt solved the above problem. Do you have any ways to help me solve it? Thank you!
My version information is as follows
datasets
version: 3.2.0
- Platform: Linux-4.19.91-014.15-kangaroo.alios7.x86_64-x86_64-with-glibc2.35
- Python version: 3.11.10
huggingface_hub
version: 0.26.5
- PyArrow version: 17.0.0
- Pandas version: 2.2.3
fsspec
version: 2024.2.0
1 Like
Apparently, this is an issue with PyArrow, and although some of it has been resolved, it still seems to be unresolved. @lhoestq
Yes, I have seen similar posts with the same issue:
Minhash Deduplication - #11 by conceptofmind
But I tried this method and it didnβt solve the error.
May I ask if there is any way you can help me solve this problem?
Thank you!
1 Like
How about trying .shard()?
Thank you for your reply.
Is. shard() partitioned only when generating the load_dataset() object
But this error occurred during load_dataset()
1 Like
It may also be another limitation of PyArrow. If you set num_shards to around 20, maybe it will work⦠I hope it does.
I have already set num_stards to 100, but the same error still exists
data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=βtrainβ)
data_set = data_set.shard(num_shards=100, index=0)
It seems that this error already exists when executing load_dataset()
1 Like