TypeError when applying map after set_format(type='torch')

simonschoe · September 13, 2022, 6:57am

Hi there,

I am trying to run a simple forward_pass function on my dataset producing the following error:

dataset['train'].set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

def forward_pass(batch):
    input_ids = torch.tensor(batch['input_ids']).to(device)
    attention_mask = torch.tensor(batch['attention_mask']).to(device)
    with torch.no_grad():
        batch['logits'] = model(input_ids, attention_mask)['logits'].cpu().numpy()
    return batch

dataset['train'].map(forward_pass, batched=True, batch_size=16)

TypeError: Provided `function` which is applied to all elements of table returns a `dict` of types [<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'numpy.ndarray'>]. When using `batched=True`, make sure provided `function` returns a `dict` of types like `(<class 'list'>, <class 'numpy.ndarray'>)`.

The error does not occur when I convert to numpy instead of torch:

dataset['train'].set_format(type='numpy', columns=['input_ids', 'attention_mask', 'label'])

Why is that the case? I couldn’t quite wrap my head around why the map call doesn’t handle the tensor data but is fine with using the numpy arrays? Super grateful for insights on the inner workings of the employed function!

Best
Simon

lhoestq · September 13, 2022, 12:57pm

Hi ! I think this is an issue in datasets, let me open a PR to fix this.

In the meantime you need to make sure every tensor in the returned batch dict should be lists or numpy arrays

lhoestq · September 13, 2022, 1:16pm

Pull request is open at Fix map batched with torch output + test by lhoestq · Pull Request #4972 · huggingface/datasets · GitHub

simonschoe · September 13, 2022, 2:33pm

Great, thanks for the quick reply and PR. The workaround with converting back to numpy works just fine, however, the fix will make the whole workflow more convenient and robust IMO.

Topic		Replies	Views
Dataset map return only list instead torch tensors Beginners	8	5687	March 17, 2025
Unable to properly map tensors to examples 🤗Datasets	6	1285	December 15, 2022
Is there a way to change batching behaviour of `map`? 🤗Datasets	3	513	April 5, 2023
The datasets.map() method doesn't keep tensor format from `tokenizer` 🤗Datasets	1	1924	November 4, 2022
Map() with a tokenizer does not return pytorch tensors Beginners	2	1157	August 23, 2023

TypeError when applying map after set_format(type='torch')

Related topics