when i use the map on Dataset object the 'input_ids return as list instead of tensors
return tokenizer(batch['text'],padding=True, return_tensors='pt', truncation=True).to(DEVICE)
data.map(tokenize, batched=False, batch_size=None) → return list
tokenizer(data['text'],padding=True, return_tensors='pt', truncation=True).to(DEVICE) → return tensor
map ignores tensor formatting while writing a cache file, so to get PyTorch tensors under the
input_ids column, you need to explicitly call
set_format("pt", columns=["input_ids"], output_all_columns=True) on the dataset object (after
I got burned by this as well. Maybe the documentation (examples) for map could be updated to explain this?
I got burned by this too! Definitely seems like an issue in the library that
.map doesn’t return the objects that the mapped function returns.
And I’d just finished the course which has a whole section about bugs caused by not setting
return_tensors, but the examples never mention this issue with
Hi! When it comes to tensors, PyArrow (the storage format we use) only understands 1D arrays, so we would have to store (potentially) a significant amount of metadata to be able to restore the types after
map fully. Also, a
map transform can return different value types for the same column (e.g. PyTorch tensors or Python lists), which would make this process ambiguous (e.g. should we return mixed types or only one type when indexing the dataset afterwards). So for the sake of simplicity, we return Python objects by default, and for other formats, one can use
I ended up using a data collator to take the lists and turn them back into tensors