The datasets.map() method doesn't keep tensor format from `tokenizer`

The map() method from a dataset does not retain the tensor that is selected in the return_tensor argument.

Code:

from transformers import AutoTokenizer
from datasets import Dataset
data = {
    "text":[
        "This is a test"
    ]
}
dataset = Dataset.from_dict(data)

model_name = 'roberta-large-mnli'
tokenizer = AutoTokenizer.from_pretrained(model_name,problem_type="multi_label_classification")
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True,return_tensors="pt")
tokenized_datasets = dataset.map(tokenize_function, remove_columns=["text"],batched=True)
tokenized_datasets[0]

Output:

{'input_ids': [0, 713, 16, 10, 1296, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}

However, just using text input into the tokenizer does.

Code:

from transformers import AutoTokenizer
model_name = 'roberta-large-mnli'
tokenizer = AutoTokenizer.from_pretrained(model_name,problem_type="multi_label_classification")
tokenizer("This is a test", truncation=True,return_tensors="pt")

Output:

{'input_ids': tensor([[   0,  713,   16,   10, 1296,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

Why is this the case?

The Dataset object stores the data in an Apache Arrow table - a super efficient columnar data format.
Therefore your torch tensors are converted to arrow when written to disk.

To get tensors back you can enable set the output format of the dataset to be “torch”:

tokenized_datasets = tokenized_datasets.with_format("torch")
tokenized_datasets[0]
# {'input_ids': tensor([[   0,  713,   16,   10, 1296,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}