Getting list of tensors instead of tensor array after using set_format

mariosasko · November 30, 2021, 11:53am

Hi,

this inconsistency is due to how PyArrow converts nested sequences to NumPy by default but can be fixed by casting the corresponding column to the ArrayXD type.

E.g. in your example:

dset = Dataset.from_dict(
    {"a": [[[2,1],[2,2]], [[3,1],[3,2]]], "b": [1,1]}, 
    features=Features({"a": Array2D(shape=(2, 2), dtype="int32"), "b": Value("int32")})
)
dset.set_format('torch', columns=['a','b'])

If you want to cast the existing dataset, use map instead of cast (cast fails on special extension types):

dset = Dataset.from_dict({"a": [[[2,1],[2,2]], [[3,1],[3,2]]], "b": [1,1]})
dset = dset.map(lambda batch: batch, batched=True, features=Features({"a": Array2D(shape=(2, 2), dtype="int32"), "b": Value("int32")}))
dset.set_format('torch', columns=['a','b'])

Topic		Replies	Views
Returns list of tensors instead of tensors with set_format in datasets Beginners	1	670	March 8, 2022
Set_format('torch') returns lists of tensors for multiple-entries sample 🤗Datasets	2	480	November 11, 2022
Dataset map return only list instead torch tensors Beginners	8	5631	March 17, 2025
Set the format of the datasets to return pytorch tensors return list of tensors but why? Beginners	3	3869	July 13, 2021
Dataset set_format 🤗Datasets	11	10298	November 24, 2024

Getting list of tensors instead of tensor array after using set_format

Related topics