I use
data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=“train”)
Report when loading dataset (approximately 84GB)
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 10761561509,
Try setting up according to the help provided in other posts
set(data_set[“hash”])
I still haven’t solved the above problem. Do you have any ways to help me solve it? Thank you!
My version information is as follows
datasets
version: 3.2.0- Platform: Linux-4.19.91-014.15-kangaroo.alios7.x86_64-x86_64-with-glibc2.35
- Python version: 3.11.10
huggingface_hub
version: 0.26.5- PyArrow version: 17.0.0
- Pandas version: 2.2.3
fsspec
version: 2024.2.0