multiprocess.pool.RemoteTraceback and TypeError: Couldn't cast array of type string to null when loading Hugging Face dataset

I’m encountering an error while trying to load and process the GAIR/MathPile dataset using the Hugging Face datasets library. The error seems to occur during type casting in pyarrow within a multiprocessing environment. Below is the code I’m using

from datasets import Dataset, load_dataset
import os

def get_hf_dataset_gair(path: str = '~/data/GAIR/MathPile/train/') -> Dataset:
    path: str = os.path.expanduser(path)
    dataset = load_dataset(path, split='train', num_proc=os.cpu_count())
    print(dataset[0])  # Preview a single example from the dataset
    
    # Remove unnecessary columns
    all_columns = dataset.column_names
    all_columns.remove('text')
    dataset = dataset.remove_columns(all_columns)
    
    # Shuffle and select 10k examples
    dataset = dataset.shuffle(seed=42)
    dataset = dataset.select(10_000)
    return dataset

# get it
get_hf_dataset_gair()

Download GAIR

source $AFS/.bashrc
conda activate beyond_scale_2

mkdir -p ~/data/GAIR/MathPile
huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir ~/data/GAIR/MathPile --local-dir-use-symlinks False

cd ~/data/GAIR/MathPile/
find . -type f -name "*.gz" -exec gzip -d {} \;

I get the following error when running the code:

multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/builder.py", line 1869, in _prepare_split_single
    writer.write_table(table)
  ...
TypeError: Couldn't cast array of type string to null

Here’s the full stack trace:

Traceback (most recent call last):
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/builder.py", line 1869, in _prepare_split_single
    writer.write_table(table)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/arrow_writer.py", line 580, in write_table
    pa_table = table_cast(pa_table, self._schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/table.py", line 2283, in table_cast
    return cast_table_to_schema(table, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
TypeError: Couldn't cast array of type string to null

It seems to be related to type casting between pyarrow types. I suspect it has something to do with the dataset schema, but I’m not sure how to resolve the error. I’ve verified that the dataset is correctly downloaded, and I’m using the following environment:

Hugging Face datasets version: 2.x.x
Python 3.11
OS: Linux (running on a server)
Multiprocessing is set to use all available CPUs (num_proc=os.cpu_count())
Has anyone encountered this issue before, or does anyone have suggestions on how to fix it?

1 Like

Is this?