Hi! map ignores tensor formatting while writing a cache file, so to get PyTorch tensors under the input_ids column, you need to explicitly call set_format("pt", columns=["input_ids"], output_all_columns=True) on the dataset object (after map).
Hi! When it comes to tensors, PyArrow (the storage format we use) only understands 1D arrays, so we would have to store (potentially) a significant amount of metadata to be able to restore the types after map fully. Also, a map transform can return different value types for the same column (e.g. PyTorch tensors or Python lists), which would make this process ambiguous (e.g. should we return mixed types or only one type when indexing the dataset afterwards). So for the sake of simplicity, we return Python objects by default, and for other formats, one can use with_format/with_transform.