Map() with a tokenizer does not return pytorch tensors

alonmiz · August 22, 2023, 7:07pm

I’m trying to tokenize a dataset and move all the torch tensors to gpu, but somehow this doesn’t work:

import datasets

cola = datasets.load_dataset(‘linxinyuan/cola’)
cola_tokenized = cola.map(lambda examples: tokenizer(examples[‘text’], return_tensors=“pt”, padding=True, truncation=True).to(‘cuda’), batched=True, batch_size=16)

print(cola_tokenized[‘train’][0][‘input_ids’])

The output is a python list and not even a tensor:

[0, 2522, …, 2]

I tried calling cola.set_format(“torch”) before map() (still doesn’t explain why the previous run didn’t return tensors) but then I get pytorch tensors that are placed on the cpu (even though I call to(‘cuda’)).

Any idea why this happens?

Luan77777 · August 22, 2023, 7:20pm

I have the same issue… Sometimes I get tensors and then sometimes only a list. I notices that when your data ist larger then your batching size your doent get a tensor and when your data is smaller you get some tensors… However when you try to print the shape or dtype of your data before the .map function there should be tensors only. When i tried that there where tensors right before i returned the element zu the map function. Maybe you couald try that! But still this function remains random and does what it wants…

mariosasko · August 23, 2023, 1:09pm

You can get Torch Tensors by changing the format Torch with .set_format("torch") (.set_format("torch", device="cuda") to put them on GPU)

map caches results into an Arrow file, and Arrow doesn’t understand Torch tensors, so this would require storing additional metadata for each example to recover the initial type for decoding later. We decided not to keep this metadata for simplicity and instead support changing the default (Arrow → Python) decoding with set_format/set_transform (docs).

Topic		Replies	Views
Dataset map return only list instead torch tensors Beginners	8	5693	March 17, 2025
The datasets.map() method doesn't keep tensor format from `tokenizer` 🤗Datasets	1	1926	November 4, 2022
TypeError when applying map after set_format(type='torch') 🤗Datasets	3	1344	September 13, 2022
Unable to properly map tensors to examples 🤗Datasets	6	1286	December 15, 2022
Set_format('torch') returns lists of tensors for multiple-entries sample 🤗Datasets	2	480	November 11, 2022

Map() with a tokenizer does not return pytorch tensors

Related topics