Prepare func failed when mapped on audio dataset

MidSummersEve · July 16, 2022, 12:19am

Greetings,

I am following this notebook on fine-tuning a Wav2Vec model

The only difference is that, the notebook made use of a prepared dataset, while I am utilizing my own data.

I’ve made sure my dataset has the same layout as in the notebook:

Mine:

The error occurred when I try to map the rather standard prepare function on the dataset.

def prepare_dataset(batch):
print(audio)
batch[“input_values”] = processor(audio[“array”],sampling_rate=audio[“sampling_rate”]).input_values[0]
batch[“input_length”] = len(batch[“input_values”])

with processor.as_target_processor():

batch["labels"] = processor(batch["UTT"]).input_ids
return batch

mapping:

audio_dataset = audio_dataset.map(prepare_dataset, num_proc=4)

Issue occurred:

{‘bytes’: None, ‘path’: ‘/content/drive/MyDrive/Dissertation/Speech/Ses02M_script01_1_M032.wav’}
{‘bytes’: None, ‘path’: ‘/content/drive/MyDrive/Dissertation/Speech/Ses02M_script01_1_M015.wav’}
{‘bytes’: None, ‘path’: ‘/content/drive/MyDrive/Dissertation/Speech/Ses02F_script01_3_M023.wav’}
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/local/lib/python3.7/dist-packages/multiprocess/pool.py”, line 121, in worker
result = (True, func(*args, **kwds))
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 518, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 485, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py”, line 413, in wrapper
out = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 2460, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 2367, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File “”, line 6, in prepare_dataset
batch[“input_values”] = processor(audio[“array”], sampling_rate=audio[“sampling_rate”]).input_values[0]
KeyError: ‘array’
“”"

So it was saying that there is no ‘array’ key under the ‘audio’ entry of my dataset. But I do have it as has been previously shown. The result of my self-add-on print function suggested that the program saw only “bytes”(value None, not loaded) & “path”(value correct) entries under the ‘audio’ entry. Why am I experiencing this contradiction? I thought all waveforms would be accordingly resampled during mapping.

My Envs:
Google Colab
datasets==1.18.3
transformers==4.17.0
torch==1.12.0+cu113

Totally new to this. Can’t figure out the reason.
Thanks in advance!

Topic		Replies	Views
Error in Dataset Map Function Beginners	3	2239	March 22, 2023
Finetuning Wav2Vec2 for ASR notebook doesn't work 🤗Datasets	0	284	September 9, 2022
How to load this simple audio data set and use dataset.map without memory issues? 🤗Datasets	12	4254	December 10, 2024
Issue of multiprocessing in map function 🤗Datasets	2	333	March 18, 2024
Datasets map modifying audio array to list? 🤗Datasets	1	1272	November 29, 2021

Prepare func failed when mapped on audio dataset

Related topics