Hi everyone!
I work with a large dataset that I want to convert into a Huggingface dataset.
As my workspace and the dataset workspace are not on the same device, I have created a HDF5 file (with h5py) that I have transmitted on my workspace.
Now I want to open that file and give the data to an empty dataset.
My code is the following:
import h5py
import datasets
import pyarrow as py
# Load the HDF5 file
SAVE_DIR = './data/'
features = h5py.File(SAVE_DIR+'features.hdf5','r')
# Validation
print("Process validation data")
valid_data = features["validation"]["data/features"]
print("Extract validation values")
v_array_values = [np.float32(item[()]) for item in valid_data.values()]
print("Create validation dataset")
v_array = [py.array(item, type=py.float32()) for item in v_array_values]
dict_valid = datasets.Dataset.from_dict({'input_values': v_array})
dict_valid.save_to_disk(SAVE_DIR+"validation_dataset")
I know v_array_values is useless in this case, it is just for testing and ensuring that I have Float32 data, which is the original data type.
However, I have this error:
Traceback (most recent call last):
File “”, line 1, in
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\arrow_dataset.py”, line 859, in from_dict
pa_table = InMemoryTable.from_pydict(mapping=mapping)
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\table.py”, line 750, in from_pydict
return cls(pa.Table.from_pydict(*args, **kwargs))
File “pyarrow\table.pxi”, line 3625, in pyarrow.lib.Table.from_pydict
File “pyarrow\table.pxi”, line 5150, in pyarrow.lib._from_pydict
File “pyarrow\array.pxi”, line 342, in pyarrow.lib.asarray
File “pyarrow\array.pxi”, line 230, in pyarrow.lib.array
File “pyarrow\array.pxi”, line 110, in pyarrow.lib._handle_arrow_array_protocol
File “C:\Users<username>\miniforge3\envs\wav2vec_pretraining\lib\site-packages\datasets\arrow_writer.py”, line 239, in arrow_array
out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True, optimize_list_casting=False))
File “pyarrow\array.pxi”, line 316, in pyarrow.lib.array
File “pyarrow\array.pxi”, line 39, in pyarrow.lib._sequence_to_array
File “pyarrow\error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow\error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.lib.FloatArray object at 0x000002A3D769B760>
[
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
-0.0000020123275,
…
0.079432674,
-0.35756943,
-0.42882702,
0.0595785,
0.38494512,
0.25881445,
0.02550633,
-0.05002562,
-0.02196826,
-0.020529125
] with type pyarrow.lib.FloatArray: did not recognize Python value type when inferring an Arrow data type
I tried a lot of stuff, like not converting my data in pyarrow array, using pandas, creating a dictionnary before, etc.
I think maybe the problem could come from the HDF5 file but I cannot reproduce the system without it.
Have you any idea of what that could come from?
Thank you!