Greetings,
I am following this notebook on fine-tuning a Wav2Vec model
The only difference is that, the notebook made use of a prepared dataset, while I am utilizing my own data.
I’ve made sure my dataset has the same layout as in the notebook:
Mine:
The error occurred when I try to map the rather standard prepare function on the dataset.
def prepare_dataset(batch):print(audio)batch[“input_values”] = processor(audio[“array”],sampling_rate=audio[“sampling_rate”]).input_values[0]
batch[“input_length”] = len(batch[“input_values”])with processor.as_target_processor():
batch["labels"] = processor(batch["UTT"]).input_idsreturn batch
mapping:
audio_dataset = audio_dataset.map(prepare_dataset, num_proc=4)
Issue occurred:
{‘bytes’: None, ‘path’: ‘/content/drive/MyDrive/Dissertation/Speech/Ses02M_script01_1_M032.wav’}
{‘bytes’: None, ‘path’: ‘/content/drive/MyDrive/Dissertation/Speech/Ses02M_script01_1_M015.wav’}
{‘bytes’: None, ‘path’: ‘/content/drive/MyDrive/Dissertation/Speech/Ses02F_script01_3_M023.wav’}
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/local/lib/python3.7/dist-packages/multiprocess/pool.py”, line 121, in worker
result = (True, func(*args, **kwds))
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 518, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 485, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py”, line 413, in wrapper
out = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 2460, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 2367, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File “”, line 6, in prepare_dataset
batch[“input_values”] = processor(audio[“array”], sampling_rate=audio[“sampling_rate”]).input_values[0]
KeyError: ‘array’
“”"
So it was saying that there is no ‘array’ key under the ‘audio’ entry of my dataset. But I do have it as has been previously shown. The result of my self-add-on print function suggested that the program saw only “bytes”(value None, not loaded) & “path”(value correct) entries under the ‘audio’ entry. Why am I experiencing this contradiction? I thought all waveforms would be accordingly resampled during mapping.
My Envs:
Google Colab
datasets==1.18.3
transformers==4.17.0
torch==1.12.0+cu113
Totally new to this. Can’t figure out the reason.
Thanks in advance!