Map() with a tokenizer does not return pytorch tensors

mariosasko · August 23, 2023, 1:09pm

You can get Torch Tensors by changing the format Torch with .set_format("torch") (.set_format("torch", device="cuda") to put them on GPU)

map caches results into an Arrow file, and Arrow doesn’t understand Torch tensors, so this would require storing additional metadata for each example to recover the initial type for decoding later. We decided not to keep this metadata for simplicity and instead support changing the default (Arrow → Python) decoding with set_format/set_transform (docs).

Topic		Replies	Views
Dataset map return only list instead torch tensors Beginners	8	5727	March 17, 2025
The datasets.map() method doesn't keep tensor format from `tokenizer` 🤗Datasets	1	1929	November 4, 2022
TypeError when applying map after set_format(type='torch') 🤗Datasets	3	1351	September 13, 2022
Unable to properly map tensors to examples 🤗Datasets	6	1290	December 15, 2022
Set_format('torch') returns lists of tensors for multiple-entries sample 🤗Datasets	2	481	November 11, 2022

Map() with a tokenizer does not return pytorch tensors

Related topics