Pyarrow failed to parse string

I’ve got several pandas dataframes saved to csv files. I’m trying to create a single Dataset object by loading them with load_dataset():

my_ds = load_dataset('/path/to/data_dir')

I haven’t explicitly checked, but I’m pretty certain all the labels in the label column are strings. Whenever I try to load the dataset, I get the following error:

pyarrow.lib.ArrowInvalid: Failed to parse string: 'a0d6fb' as a scalar of type int64

Here is the full traceback:

Resolving data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 447/447 [00:00<00:00, 111898.17it/s]
Using custom data configuration pd_data_test-d4ecbb8864e740ad
Downloading and preparing dataset csv/pd_data_test to /home/aclifton/.cache/huggingface/datasets/csv/pd_data_test-d4ecbb8864e740ad/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 82.99it/s]
Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  7.33it/s]
Traceback (most recent call last):
  File "/home/aclifton/rf_fp/gather_files.py", line 71, in <module>
    my_ds = load_dataset('/home/aclifton/rf_fp/pd_data_test')
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/load.py", line 1691, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 605, in download_and_prepare
    self._download_and_prepare(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 694, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 1154, in _prepare_split
    writer.write_table(table)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/arrow_writer.py", line 523, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/table.py", line 1860, in table_cast
    return cast_table_to_schema(table, schema)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/table.py", line 1843, in cast_table_to_schema
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/table.py", line 1843, in <listcomp>
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/table.py", line 1672, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/table.py", line 1672, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/table.py", line 1808, in cast_array_to_feature
    return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/table.py", line 1674, in wrapper
    return func(array, *args, **kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/table.py", line 1741, in array_cast
    return array.cast(pa_type)
  File "pyarrow/array.pxi", line 826, in pyarrow.lib.Array.cast
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/pyarrow/compute.py", line 375, in cast
    return call_function("cast", [arr], options)
  File "pyarrow/_compute.pyx", line 531, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 330, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Failed to parse string: 'a0d6fb' as a scalar of type int64

Any ideas about what might be going on? Thanks in advance for your help!

1 Like

Hi! We use this code to read CSV files in datasets: datasets/csv.py at f3b6697011cb6fc568b8f8b32f53501a8f2e8967 Β· huggingface/datasets Β· GitHub. As you can see, the files are processed in chunks, so this could mean some chunks in your data contain string labels and some integer labels. Please verify that’s not the case.