Dataset map return only list instead torch tensors

when i use the map on Dataset object the 'input_ids return as list instead of tensors

def tokenize(batch):
  return tokenizer(batch['text'],padding=True, return_tensors='pt', truncation=True).to(DEVICE)

data.map(tokenize, batched=False, batch_size=None) → return list

tokenizer(data['text'],padding=True, return_tensors='pt', truncation=True).to(DEVICE) → return tensor

Hi! map ignores tensor formatting while writing a cache file, so to get PyTorch tensors under the input_ids column, you need to explicitly call set_format("pt", columns=["input_ids"], output_all_columns=True) on the dataset object (after map).

2 Likes

I got burned by this as well. Maybe the documentation (examples) for map could be updated to explain this?

1 Like