Hi! map ignores tensor formatting while writing a cache file, so to get PyTorch tensors under the input_ids column, you need to explicitly call set_format("pt", columns=["input_ids"], output_all_columns=True) on the dataset object (after map).
I got burned by this too! Definitely seems like an issue in the library that .map doesn’t return the objects that the mapped function returns.
And I’d just finished the course which has a whole section about bugs caused by not setting return_tensors, but the examples never mention this issue with .map.
Hi! When it comes to tensors, PyArrow (the storage format we use) only understands 1D arrays, so we would have to store (potentially) a significant amount of metadata to be able to restore the types after map fully. Also, a map transform can return different value types for the same column (e.g. PyTorch tensors or Python lists), which would make this process ambiguous (e.g. should we return mixed types or only one type when indexing the dataset afterwards). So for the sake of simplicity, we return Python objects by default, and for other formats, one can use with_format/with_transform.
Yep, I also got burned by this. What is the point here to return list of lists when it’s a torch tensor? Inside the map function, it’s a tensor, then after it’s a list?