Dataset map return only list instead torch tensors

MPA · March 16, 2022, 9:56am

when i use the map on Dataset object the 'input_ids return as list instead of tensors

def tokenize(batch):
  return tokenizer(batch['text'],padding=True, return_tensors='pt', truncation=True).to(DEVICE)

data.map(tokenize, batched=False, batch_size=None) → return list

tokenizer(data['text'],padding=True, return_tensors='pt', truncation=True).to(DEVICE) → return tensor

mariosasko · March 21, 2022, 12:18pm

Hi! map ignores tensor formatting while writing a cache file, so to get PyTorch tensors under the input_ids column, you need to explicitly call set_format("pt", columns=["input_ids"], output_all_columns=True) on the dataset object (after map).

murphyk · September 29, 2022, 10:36pm

I got burned by this as well. Maybe the documentation (examples) for map could be updated to explain this?

Davidg707 · December 21, 2022, 4:18am

I got burned by this too! Definitely seems like an issue in the library that .map doesn’t return the objects that the mapped function returns.

And I’d just finished the course which has a whole section about bugs caused by not setting return_tensors, but the examples never mention this issue with .map.

mariosasko · December 22, 2022, 7:49pm

Hi! When it comes to tensors, PyArrow (the storage format we use) only understands 1D arrays, so we would have to store (potentially) a significant amount of metadata to be able to restore the types after map fully. Also, a map transform can return different value types for the same column (e.g. PyTorch tensors or Python lists), which would make this process ambiguous (e.g. should we return mixed types or only one type when indexing the dataset afterwards). So for the sake of simplicity, we return Python objects by default, and for other formats, one can use with_format/with_transform.

sd3ntato · June 27, 2023, 11:01am

I ended up using a data collator to take the lists and turn them back into tensors

kenfus · March 1, 2024, 4:12pm

Yep, I also got burned by this. What is the point here to return list of lists when it’s a torch tensor? Inside the map function, it’s a tensor, then after it’s a list?

hammer · April 10, 2024, 3:48pm

Also burned by this coercion, would be helpful to note the PyArrow limitation (although Tensor may be helpful here?) and workarounds in the docs…

Edit: looks like there’s some discussion on how to use PyArrow’s Tensor type at Use pyarrow Tensor dtype · Issue #5272 · huggingface/datasets · GitHub

Pavel6453 · March 17, 2025, 12:20pm

Damn that’s sneaky! I just spent a whole day debugging this thing. I mean, tokenizer does return pytorch tensors, but not if called within Map, Map just ignores that argument! Well, turns out it was some internal implementation thing that’s supposed to be “expected”. I really wish that it was explicitly stated in the docs, not hidden like that

Topic		Replies	Views
The datasets.map() method doesn't keep tensor format from `tokenizer` 🤗Datasets	1	1949	November 4, 2022
Map() with a tokenizer does not return pytorch tensors Beginners	2	1174	August 23, 2023
Applying `.map` results in getting `List` type on `input_values` 🤗Datasets	1	388	November 9, 2023
TypeError when applying map after set_format(type='torch') 🤗Datasets	3	1366	September 13, 2022
Unable to properly map tensors to examples 🤗Datasets	6	1302	December 15, 2022

Dataset map return only list instead torch tensors

Related topics