Standard getitem returns wrong data type for arrays

bbiltzing · November 16, 2023, 10:37am

Hi,
I have created a huggingface dataset that contains some columns that contain arrays. The dtype of these arrays are casted as int32. But when I get the value from the dataset it gives me an array with int64. Here is a minimal example of the problem:

from datasets import Dataset
from datasets.features import Sequence, Value, Features
import pandas as pd 
pd = pd.DataFrame([[[1,2,3],[3,4,5]],[[10,20,30],[30,40,50]]], columns=["A", "B"])
d = Dataset.from_pandas(pd, features=Features({
    "A": Sequence(Value(dtype="int32")),
    "B": Sequence(Value(dtype="int32")),
}))
d.set_format("numpy")

The output of d.features is as expected

{'A': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'B': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None)}

But when I get the dtype of one of the values with d["A"][0].dtype it gives dtype('int64').

The same thing also happens for arrays with float dtype. These are always returned as float32, no matter the specified dtype in the feature.

I tried to find in the code why this is happening and the problem seems to be that the default dtype specified here is not overwritten with the dtype specified in the feature. I can call d._getitem(0, format_kwargs={"dtype": np.int32}), which returns the array with the correct dtype, but I can of course not specify the format_kwargs in the normal data access (e.g. d["A"][0]). Also I think the correct dtype should be constructed from the features without the need to manually specifying it all the time.

Is this behaviour intentional? If yes, why?

Thanks a lot in advance

mariosasko · November 16, 2023, 3:21pm

You can get int32 values with d.set_format("numpy", dtype=np.int32). More info on this issue is available in `with_format("numpy")` silently downcasts float64 to float32 features · Issue #5517 · huggingface/datasets · GitHub (for the float case). We plan to drop this behavior in Datasets 3.0.

bbiltzing · November 17, 2023, 9:25am

Thanks for the fast reply.

The solution with d.set_format("numpy", dtype=np.int32) only works if all columns are int32, correct? Because when I also have columns with float arrays they are also returned as int32 in this case.

I can for now fix this issue for us, by manually casting the return of these columns to the correct dtype before using it, but that is of course not ideal. So I believe it would be wise to discontinue this behaviour in a future release, as you suggested.

Topic		Replies	Views
Setting dataset feature value as numpy array 🤗Datasets	7	7790	November 14, 2023
Dataset Viewer not available on features of type datasets.Array2D(shape=(None, 768), dtype='float64') 🤗Datasets	7	35	May 14, 2025
Export own dataset with different feature types to TFRecord 🤗Datasets	6	1338	April 17, 2023
Dataset.from_pandas insist on converting string to int64 🤗Datasets	0	435	July 23, 2024
Create dataset consisting of numpy arrays, Sequence or ArrayND? 🤗Datasets	1	149	October 24, 2024

Standard getitem returns wrong data type for arrays

Related topics