Dataset map return only list instead torch tensors

mariosasko · December 22, 2022, 7:49pm

Hi! When it comes to tensors, PyArrow (the storage format we use) only understands 1D arrays, so we would have to store (potentially) a significant amount of metadata to be able to restore the types after map fully. Also, a map transform can return different value types for the same column (e.g. PyTorch tensors or Python lists), which would make this process ambiguous (e.g. should we return mixed types or only one type when indexing the dataset afterwards). So for the sake of simplicity, we return Python objects by default, and for other formats, one can use with_format/with_transform.

Topic		Replies	Views
The datasets.map() method doesn't keep tensor format from `tokenizer` 🤗Datasets	1	1936	November 4, 2022
Map() with a tokenizer does not return pytorch tensors Beginners	2	1167	August 23, 2023
Applying `.map` results in getting `List` type on `input_values` 🤗Datasets	1	385	November 9, 2023
TypeError when applying map after set_format(type='torch') 🤗Datasets	3	1355	September 13, 2022
Unable to properly map tensors to examples 🤗Datasets	6	1294	December 15, 2022

Dataset map return only list instead torch tensors

Related topics