multiprocess.pool.RemoteTraceback and TypeError: Couldn't cast array of type string to null when loading Hugging Face dataset

brando · September 23, 2024, 4:04pm

I’m encountering an error while trying to load and process the GAIR/MathPile dataset using the Hugging Face datasets library. The error seems to occur during type casting in pyarrow within a multiprocessing environment. Below is the code I’m using

from datasets import Dataset, load_dataset
import os

def get_hf_dataset_gair(path: str = '~/data/GAIR/MathPile/train/') -> Dataset:
    path: str = os.path.expanduser(path)
    dataset = load_dataset(path, split='train', num_proc=os.cpu_count())
    print(dataset[0])  # Preview a single example from the dataset
    
    # Remove unnecessary columns
    all_columns = dataset.column_names
    all_columns.remove('text')
    dataset = dataset.remove_columns(all_columns)
    
    # Shuffle and select 10k examples
    dataset = dataset.shuffle(seed=42)
    dataset = dataset.select(10_000)
    return dataset

# get it
get_hf_dataset_gair()

Download GAIR

source $AFS/.bashrc
conda activate beyond_scale_2

mkdir -p ~/data/GAIR/MathPile
huggingface-cli download --resume-download --repo-type dataset GAIR/MathPile --local-dir ~/data/GAIR/MathPile --local-dir-use-symlinks False

cd ~/data/GAIR/MathPile/
find . -type f -name "*.gz" -exec gzip -d {} \;

I get the following error when running the code:

multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/builder.py", line 1869, in _prepare_split_single
    writer.write_table(table)
  ...
TypeError: Couldn't cast array of type string to null

Here’s the full stack trace:

Traceback (most recent call last):
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/builder.py", line 1869, in _prepare_split_single
    writer.write_table(table)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/arrow_writer.py", line 580, in write_table
    pa_table = table_cast(pa_table, self._schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/datasets/table.py", line 2283, in table_cast
    return cast_table_to_schema(table, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
TypeError: Couldn't cast array of type string to null

It seems to be related to type casting between pyarrow types. I suspect it has something to do with the dataset schema, but I’m not sure how to resolve the error. I’ve verified that the dataset is correctly downloaded, and I’m using the following environment:

Hugging Face datasets version: 2.x.x
Python 3.11
OS: Linux (running on a server)
Multiprocessing is set to use all available CPUs (num_proc=os.cpu_count())
Has anyone encountered this issue before, or does anyone have suggestions on how to fix it?

John6666 · September 24, 2024, 4:56am

Is this?

github.com/huggingface/datasets

TypeError: Couldn't cast array of type string to null

opened 09:12PM - 10 Feb 23 UTC

closed 09:35AM - 14 Feb 23 UTC

TJ-Solergibert

### Describe the bug Processing a dataset I alredy uploaded to the Hub (https…://huggingface.co/datasets/tj-solergibert/Europarl-ST) I found that for some splits and some languages (test split, source_lang = "nl") after applying a map function I get the mentioned error. I alredy tried reseting the shorter strings (reset_cortas function). It only happends with NL, PL, RO and PT. It does not make sense since when processing the other languages I also use the corpus of those that fail and it does not cause any errors. I suspect that the error may be in this direction: We use cast_array_to_feature to support casting to custom types like Audio and Image # Also, when trying type "string", we don't want to convert integers or floats to "string". # We only do it if trying_type is False - since this is what the user asks for. ### Steps to reproduce the bug Here I link a colab notebook to reproduce the error: https://colab.research.google.com/drive/1JCrS7FlGfu_kFqChMrwKZ_bpabnIMqbP?authuser=1#scrollTo=FBAvlhMxIzpA ### Expected behavior Data processing does not fail. A correct example can be seen here: https://huggingface.co/datasets/tj-solergibert/Europarl-ST-processed-mt-en ### Environment info - `datasets` version: 2.9.0 - Platform: Linux-5.10.147+-x86_64-with-glibc2.29 - Python version: 3.8.10 - PyArrow version: 9.0.0 - Pandas version: 1.3.5

Topic		Replies	Views
Strange Error While Attempting to Load DataSet 🤗Datasets	7	3386	March 28, 2025
DatasetGenerationError. Failed to parse string: as a scalar of type double Beginners	3	80	January 7, 2025
TypeError when use Dataset.from_generator, but it disappeared after I chunked dataset file 🤗Datasets	0	383	August 2, 2023
TypeError: Couldn't cast array of type int64 to null 🤗Datasets	3	87	February 6, 2025
Dataset viewer issue Site Feedback	7	277	January 17, 2025

multiprocess.pool.RemoteTraceback and TypeError: Couldn't cast array of type string to null when loading Hugging Face dataset

Related topics