Map() with a tokenizer does not return pytorch tensors

I’m trying to tokenize a dataset and move all the torch tensors to gpu, but somehow this doesn’t work:

import datasets

cola = datasets.load_dataset(‘linxinyuan/cola’)
cola_tokenized = cola.map(lambda examples: tokenizer(examples[‘text’], return_tensors=“pt”, padding=True, truncation=True).to(‘cuda’), batched=True, batch_size=16)

print(cola_tokenized[‘train’][0][‘input_ids’])

The output is a python list and not even a tensor:

[0, 2522, …, 2]

I tried calling cola.set_format(“torch”) before map() (still doesn’t explain why the previous run didn’t return tensors) but then I get pytorch tensors that are placed on the cpu (even though I call to(‘cuda’)).

Any idea why this happens?

I have the same issue… Sometimes I get tensors and then sometimes only a list. I notices that when your data ist larger then your batching size your doent get a tensor and when your data is smaller you get some tensors… However when you try to print the shape or dtype of your data before the .map function there should be tensors only. When i tried that there where tensors right before i returned the element zu the map function. Maybe you couald try that! But still this function remains random and does what it wants…

You can get Torch Tensors by changing the format Torch with .set_format("torch") (.set_format("torch", device="cuda") to put them on GPU)

map caches results into an Arrow file, and Arrow doesn’t understand Torch tensors, so this would require storing additional metadata for each example to recover the initial type for decoding later. We decided not to keep this metadata for simplicity and instead support changing the default (Arrow → Python) decoding with set_format/set_transform (docs).

2 Likes