Applying `.map` results in getting `List` type on `input_values`

ifedorov · November 9, 2023, 5:19am

Hi, I have audio dataset. Using .map method, I apply a function that reads the audios from the disk, resamples them and applies Wav2Vec2FeatureExtractor, which normalizes the audio and converts it to torch tensor.

def preprocess_function(samples):
    speech_list = [speech_file_to_array_fn(path) for path in samples[input_column]]
    target_list = [label_to_id(label, label_list) for label in samples[output_column]]

    result = processor(speech_list, sampling_rate=target_sampling_rate, return_tensors='pt')
    result['labels'] = list(target_list)
    return result

eval_dataset = eval_dataset.map(
    preprocess_function,
    num_proc=1,
    batched=True,
    batch_size=1
)

The result variable in the preprocess function contains a dict with pytorch tensors as values. But when I index the dataset after the transformation, I get List type of input_values. Is it possible to not convert the values to List and keep them as torch.tensor?

panigrah · November 9, 2023, 6:02am

see here… dataset returns pure python objects.

github.com/huggingface/datasets

Dataset.map() turns tensors into lists?

opened 11:43AM - 03 Dec 20 UTC

closed 12:12PM - 05 Oct 22 UTC

tombosc

I apply `Dataset.map()` to a function that returns a dict of torch tensors (like… a tokenizer from the repo transformers). However, in the mapped dataset, these tensors have turned to lists! ```import datasets import torch from datasets import load_dataset print("version datasets", datasets.__version__) dataset = load_dataset("snli", split='train[0:50]') def tokenizer_fn(example): # actually uses a tokenizer which does something like: return {'input_ids': torch.tensor([[0, 1, 2]])} print("First item in dataset:\n", dataset[0]) tokenized = tokenizer_fn(dataset[0]) print("Tokenized hyp:\n", tokenized) dataset_tok = dataset.map(tokenizer_fn, batched=False, remove_columns=['label', 'premise', 'hypothesis']) print("Tokenized using map:\n", dataset_tok[0]) print(type(tokenized['input_ids']), type(dataset_tok[0]['input_ids'])) dataset_tok = dataset.map(tokenizer_fn, batched=False, remove_columns=['label', 'premise', 'hypothesis']) print("Tokenized using map:\n", dataset_tok[0]) print(type(tokenized['input_ids']), type(dataset_tok[0]['input_ids'])) ``` The output is: ``` version datasets 1.1.3 Reusing dataset snli (/home/tom/.cache/huggingface/datasets/snli/plain_text/1.0.0/bb1102591c6230bd78813e229d5dd4c7fbf4fc478cec28f298761eb69e5b537c) First item in dataset: {'premise': 'A person on a horse jumps over a broken down airplane.', 'hypothesis': 'A person is training his horse for a competition.', 'label': 1} Tokenized hyp: {'input_ids': tensor([[0, 1, 2]])} Loading cached processed dataset at /home/tom/.cache/huggingface/datasets/snli/plain_text/1.0.0/bb1102591c6230bd78813e229d5dd4c7fbf4fc478cec28f298761eb69e5b537c/cache-fe38f449fe9ac46f.arrow Tokenized using map: {'input_ids': [[0, 1, 2]]} <class 'torch.Tensor'> <class 'list'> ``` Or am I doing something wrong?

here is one possible approach but it has other side effects.

eval_dataset = eval_dataset.with_format('tf')

Topic		Replies	Views
Dataset map return only list instead torch tensors Beginners	8	5688	March 17, 2025
The datasets.map() method doesn't keep tensor format from `tokenizer` 🤗Datasets	1	1925	November 4, 2022
TypeError when applying map after set_format(type='torch') 🤗Datasets	3	1344	September 13, 2022
Dataset.map saves list as numpy array instead of as list 🤗Datasets	2	1426	January 3, 2023
TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'list'> 🤗Datasets	2	6442	February 28, 2024

Applying `.map` results in getting `List` type on `input_values`

Related topics