Hi,
I have created a huggingface dataset that contains some columns that contain arrays. The dtype of these arrays are casted as int32. But when I get the value from the dataset it gives me an array with int64. Here is a minimal example of the problem:
from datasets import Dataset
from datasets.features import Sequence, Value, Features
import pandas as pd
pd = pd.DataFrame([[[1,2,3],[3,4,5]],[[10,20,30],[30,40,50]]], columns=["A", "B"])
d = Dataset.from_pandas(pd, features=Features({
"A": Sequence(Value(dtype="int32")),
"B": Sequence(Value(dtype="int32")),
}))
d.set_format("numpy")
The output of d.features
is as expected
{'A': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
'B': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None)}
But when I get the dtype of one of the values with d["A"][0].dtype
it gives dtype('int64')
.
The same thing also happens for arrays with float dtype. These are always returned as float32, no matter the specified dtype in the feature.
I tried to find in the code why this is happening and the problem seems to be that the default dtype specified here is not overwritten with the dtype specified in the feature. I can call d._getitem(0, format_kwargs={"dtype": np.int32})
, which returns the array with the correct dtype, but I can of course not specify the format_kwargs
in the normal data access (e.g. d["A"][0]
). Also I think the correct dtype should be constructed from the features without the need to manually specifying it all the time.
Is this behaviour intentional? If yes, why?
Thanks a lot in advance