The map()
method from a dataset does not retain the tensor that is selected in the return_tensor
argument.
Code:
from transformers import AutoTokenizer
from datasets import Dataset
data = {
"text":[
"This is a test"
]
}
dataset = Dataset.from_dict(data)
model_name = 'roberta-large-mnli'
tokenizer = AutoTokenizer.from_pretrained(model_name,problem_type="multi_label_classification")
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True,return_tensors="pt")
tokenized_datasets = dataset.map(tokenize_function, remove_columns=["text"],batched=True)
tokenized_datasets[0]
Output:
{'input_ids': [0, 713, 16, 10, 1296, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}
However, just using text input into the tokenizer does.
Code:
from transformers import AutoTokenizer
model_name = 'roberta-large-mnli'
tokenizer = AutoTokenizer.from_pretrained(model_name,problem_type="multi_label_classification")
tokenizer("This is a test", truncation=True,return_tensors="pt")
Output:
{'input_ids': tensor([[ 0, 713, 16, 10, 1296, 2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
Why is this the case?