Applying `.map` results in getting `List` type on `input_values`

Hi, I have audio dataset. Using .map method, I apply a function that reads the audios from the disk, resamples them and applies Wav2Vec2FeatureExtractor, which normalizes the audio and converts it to torch tensor.

def preprocess_function(samples):
    speech_list = [speech_file_to_array_fn(path) for path in samples[input_column]]
    target_list = [label_to_id(label, label_list) for label in samples[output_column]]

    result = processor(speech_list, sampling_rate=target_sampling_rate, return_tensors='pt')
    result['labels'] = list(target_list)
    return result

eval_dataset = eval_dataset.map(
    preprocess_function,
    num_proc=1,
    batched=True,
    batch_size=1
)

The result variable in the preprocess function contains a dict with pytorch tensors as values. But when I index the dataset after the transformation, I get List type of input_values. Is it possible to not convert the values to List and keep them as torch.tensor?

see here… dataset returns pure python objects.

here is one possible approach but it has other side effects.

eval_dataset = eval_dataset.with_format('tf')

1 Like