TypeError when applying map after set_format(type='torch')

Hi there,

I am trying to run a simple forward_pass function on my dataset producing the following error:

dataset['train'].set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

def forward_pass(batch):
    input_ids = torch.tensor(batch['input_ids']).to(device)
    attention_mask = torch.tensor(batch['attention_mask']).to(device)
    with torch.no_grad():
        batch['logits'] = model(input_ids, attention_mask)['logits'].cpu().numpy()
    return batch

dataset['train'].map(forward_pass, batched=True, batch_size=16)
TypeError: Provided `function` which is applied to all elements of table returns a `dict` of types [<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'numpy.ndarray'>]. When using `batched=True`, make sure provided `function` returns a `dict` of types like `(<class 'list'>, <class 'numpy.ndarray'>)`.

The error does not occur when I convert to numpy instead of torch:

dataset['train'].set_format(type='numpy', columns=['input_ids', 'attention_mask', 'label'])

Why is that the case? I couldn’t quite wrap my head around why the map call doesn’t handle the tensor data but is fine with using the numpy arrays? Super grateful for insights on the inner workings of the employed function! :slight_smile:

Best
Simon

Hi ! I think this is an issue in datasets, let me open a PR to fix this.

In the meantime you need to make sure every tensor in the returned batch dict should be lists or numpy arrays

Pull request is open at Fix map batched with torch output + test by lhoestq · Pull Request #4972 · huggingface/datasets · GitHub

1 Like

Great, thanks for the quick reply and PR. The workaround with converting back to numpy works just fine, however, the fix will make the whole workflow more convenient and robust IMO.