Missing ['array] when using map() method

Hello every one. When I use map method to modify the data. Return a KeyError: ‘array’.
But I’m sure my dataset include this column.

print(ds_train)
print(ds_train[1]['audio']['array'])

And the output is:

Dataset({
    features: ['audio', 'gender'],
    num_rows: 16960
})
tensor([-8.2690e-14, -7.3000e-13,  1.5195e-13,  ...,  8.2001e-07,
         9.9102e-07, -3.9292e-07], device='cuda:0')

My code is:

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
def preprocess_function(examples):
    print(examples)
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000,
        truncation=True, padding=True, 
    )
    return inputs
ds_train = ds_train.map(preprocess_function, remove_columns="audio",  batched=True)

I’m not sure what happend. Thank you in advance.

I’m seeing this behavior too. array is there when manually indexing into the dataset, but only {'bytes': None, 'path': './data/xyz.mp3'} inside of the map() callback function.

It only seems to happen if I have dataset.set_format(type='torch') - if I don’t call that before map(), then it works as expected.

1 Like

That works! Thank you!