pyarrow.lib.FloatArray: did not recognize Python value type when inferring an Arrow data type

Hi everyone!

I work with a large dataset that I want to convert into a Huggingface dataset.
As my workspace and the dataset workspace are not on the same device, I have created a HDF5 file (with h5py) that I have transmitted on my workspace.

Now I want to open that file and give the data to an empty dataset.
My code is the following:

import h5py
import datasets
import pyarrow as py

# Load the HDF5 file
SAVE_DIR = './data/'

features = h5py.File(SAVE_DIR+'features.hdf5','r')

# Validation
print("Process validation data")
valid_data = features["validation"]["data/features"]

print("Extract validation values")
v_array_values = [np.float32(item[()]) for item in valid_data.values()]

print("Create validation dataset")
v_array = [py.array(item, type=py.float32()) for item in v_array_values]
dict_valid = datasets.Dataset.from_dict({'input_values': v_array})
dict_valid.save_to_disk(SAVE_DIR+"validation_dataset")

I know v_array_values is useless in this case, it is just for testing and ensuring that I have Float32 data, which is the original data type.
However, I have this error:

Traceback (most recent call last):
File “”, line 1, in
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\arrow_dataset.py”, line 859, in from_dict
pa_table = InMemoryTable.from_pydict(mapping=mapping)
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\table.py”, line 750, in from_pydict
return cls(pa.Table.from_pydict(*args, **kwargs))
File “pyarrow\table.pxi”, line 3625, in pyarrow.lib.Table.from_pydict
File “pyarrow\table.pxi”, line 5150, in pyarrow.lib._from_pydict
File “pyarrow\array.pxi”, line 342, in pyarrow.lib.asarray
File “pyarrow\array.pxi”, line 230, in pyarrow.lib.array
File “pyarrow\array.pxi”, line 110, in pyarrow.lib._handle_arrow_array_protocol
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\arrow_writer.py”, line 239, in arrow_array
out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True, optimize_list_casting=False))
File “pyarrow\array.pxi”, line 316, in pyarrow.lib.array
File “pyarrow\array.pxi”, line 39, in pyarrow.lib._sequence_to_array
File “pyarrow\error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow\error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.lib.FloatArray object at 0x000002A3D769B760>
[
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,

0.079432674,
-0.35756943,
-0.42882702,
0.0595785,
0.38494512,
0.25881445,
0.02550633,
-0.05002562,
-0.02196826,
-0.020529125
] with type pyarrow.lib.FloatArray: did not recognize Python value type when inferring an Arrow data type

I tried a lot of stuff, like not converting my data in pyarrow array, using pandas, creating a dictionnary before, etc.
I think maybe the problem could come from the HDF5 file but I cannot reproduce the system without it.
Have you any idea of what that could come from?

Thank you!

This error was because of v_array and the use of py.array().
However, when I remove that, I still have an error:

File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\runpy.py”, line 87, in run_code
exec(code, run_globals)
File "c:\Users<username>.vscode\extensions\ms-python.python-2023.4.1\pythonFiles\lib\python\debugpy\adapter/…/…\debugpy\launcher/…/…\debugpy_main
.py", line 39, in
cli.main()
File “c:\Users<username>.vscode\extensions\ms-python.python-2023.4.1\pythonFiles\lib\python\debugpy\adapter/…/…\debugpy\launcher/…/…\debugpy/…\debugpy\server\cli.py”, line 430, in main
run()
File “c:\Users<username>.vscode\extensions\ms-python.python-2023.4.1\pythonFiles\lib\python\debugpy\adapter/…/…\debugpy\launcher/…/…\debugpy/…\debugpy\server\cli.py”, line 284, in run_file
runpy.run_path(target, run_name=“main”)
File “c:\Users<username>.vscode\extensions\ms-python.python-2023.4.1\pythonFiles\lib\python\debugpy_vendored\pydevd_pydevd_bundle\pydevd_runpy.py”, line 321, in run_path
return _run_module_code(code, init_globals, run_name,
File “c:\Users<username>.vscode\extensions\ms-python.python-2023.4.1\pythonFiles\lib\python\debugpy_vendored\pydevd_pydevd_bundle\pydevd_runpy.py”, line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File “c:\Users<username>.vscode\extensions\ms-python.python-2023.4.1\pythonFiles\lib\python\debugpy_vendored\pydevd_pydevd_bundle\pydevd_runpy.py”, line 124, in _run_code
exec(code, run_globals)
File “c:\Users<username>\Documents\Codes_Python\pretraining_wav2vec\Code_Lightning\wav2vec_retraining\src\data\hdf5_to_validation.py”, line 97, in
dict_valid = datasets.Dataset.from_dict({‘input_values’: v_array_values})
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\arrow_dataset.py”, line 859, in from_dict
pa_table = InMemoryTable.from_pydict(mapping=mapping)
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\table.py”, line 750, in from_pydict
return cls(pa.Table.from_pydict(*args, **kwargs))
File “pyarrow\table.pxi”, line 3625, in pyarrow.lib.Table.from_pydict
File “pyarrow\table.pxi”, line 5150, in pyarrow.lib._from_pydict
File “pyarrow\array.pxi”, line 342, in pyarrow.lib.asarray
File “pyarrow\array.pxi”, line 230, in pyarrow.lib.array
File “pyarrow\array.pxi”, line 110, in pyarrow.lib._handle_arrow_array_protocol
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\arrow_writer.py”, line 180, in arrow_array
out = list_of_np_array_to_pyarrow_listarray(data)
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\features\features.py”, line 1330, in list_of_np_array_to_pyarrow_listarray
return list_of_pa_arrays_to_pyarrow_listarray(
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\features\features.py”, line 1322, in list_of_pa_arrays_to_pyarrow_listarray
offsets = pa.array(offsets, type=pa.int32())
File “pyarrow\array.pxi”, line 312, in pyarrow.lib.array
File “pyarrow\array.pxi”, line 83, in pyarrow.lib._ndarray_to_array
OverflowError: Python int too large to convert to C long

Do you have an idea about what could cause this error? I am on Windows, but I have an analog error on a Linux device too.
Thanks!

Hi! Yes, from_dict does not currently support PyArrow arrays as column values, but it should, so I’ll fix this for the next release.

To fix the overflow error, we need to merge support LargeListArray in pyarrow by xwwwwww · Pull Request #4800 · huggingface/datasets · GitHub, which adds support for the large lists. However, before merging it, we need to come up with a cleaner API for large lists. I hope to find some time to address this before Datasets 3.0.

Thank you for your response!
I removed the conversion into pyarrow array and the values which do not reach the maximum length (to have all my data with the same size/shape).
That works with these changes :slight_smile:

1 Like