I’m encountering an error while trying to load and process the GAIR/MathPile dataset using the Hugging Face datasets library. The error seems to occur during type casting in pyarrow within a multiprocessing environment. Below is the code I’m using
from datasets import Dataset, load_dataset
import os
def get_hf_dataset_gair(path: str = '~/data/GAIR/MathPile/train/') -> Dataset:
path: str = os.path.expanduser(path)
dataset = load_dataset(path, split='train', num_proc=os.cpu_count())
print(dataset[0]) # Preview a single example from the dataset
# Remove unnecessary columns
all_columns = dataset.column_names
all_columns.remove('text')
dataset = dataset.remove_columns(all_columns)
# Shuffle and select 10k examples
dataset = dataset.shuffle(seed=42)
dataset = dataset.select(10_000)
return dataset
# get it
get_hf_dataset_gair()
Download GAIR
source $AFS/.bashrc
conda activate beyond_scale_2
mkdir -p ~/data/GAIR/MathPile
huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir ~/data/GAIR/MathPile --local-dir-use-symlinks False
cd ~/data/GAIR/MathPile/
find . -type f -name "*.gz" -exec gzip -d {} \;
I get the following error when running the code:
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/builder.py", line 1869, in _prepare_split_single
writer.write_table(table)
...
TypeError: Couldn't cast array of type string to null
Here’s the full stack trace:
Traceback (most recent call last):
File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/builder.py", line 1869, in _prepare_split_single
writer.write_table(table)
File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/arrow_writer.py", line 580, in write_table
pa_table = table_cast(pa_table, self._schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/table.py", line 2283, in table_cast
return cast_table_to_schema(table, schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
TypeError: Couldn't cast array of type string to null
It seems to be related to type casting between pyarrow types. I suspect it has something to do with the dataset schema, but I’m not sure how to resolve the error. I’ve verified that the dataset is correctly downloaded, and I’m using the following environment:
Hugging Face datasets version: 2.x.x
Python 3.11
OS: Linux (running on a server)
Multiprocessing is set to use all available CPUs (num_proc=os.cpu_count())
Has anyone encountered this issue before, or does anyone have suggestions on how to fix it?
- ref: GAIR/MathPile · Still errors with GAIR loading dataset
- ref: multiprocess.pool.RemoteTraceback and TypeError: Couldn't cast array of type string to null when loading Hugging Face dataset
- ref: python - multiprocess.pool.RemoteTraceback and TypeError: Couldn't cast array of type string to null when loading Hugging Face dataset - Stack Overflow