Map() with a tokenizer does not return pytorch tensors

You can get Torch Tensors by changing the format Torch with .set_format("torch") (.set_format("torch", device="cuda") to put them on GPU)

map caches results into an Arrow file, and Arrow doesn’t understand Torch tensors, so this would require storing additional metadata for each example to recover the initial type for decoding later. We decided not to keep this metadata for simplicity and instead support changing the default (Arrow → Python) decoding with set_format/set_transform (docs).

2 Likes