LoadDataSet pyarrow.lib.ArrowCapacityError

I use

data_set = load_dataset(self.data_file_path, cache_dir=cache_dir, split=“train”)

Report when loading dataset (approximately 84GB)

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 10761561509,

Try setting up according to the help provided in other posts

set(data_set[“hash”])

I still haven’t solved the above problem. Do you have any ways to help me solve it? Thank you!

My version information is as follows

  • datasets version: 3.2.0
  • Platform: Linux-4.19.91-014.15-kangaroo.alios7.x86_64-x86_64-with-glibc2.35
  • Python version: 3.11.10
  • huggingface_hub version: 0.26.5
  • PyArrow version: 17.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.2.0
1 Like