Hi,
I am trying to finetune Whisper on a small compilation of corpora of the Czech language. I am basically using https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/run_speech_recognition_seq2seq_streaming.py modified for AudioFolder by using “audiofolder” as dataset name and adding the data_dir
parameter. However, I get the above exception.
It seems that instead of loading the data based on the metadata file, all data from the data directory is loaded and then the metadata file is searched for the corresponding entry. If so, isn’t this problenatic performancewise? And what are my options besides copying all desired files to a new location just for loading?
Full stack trace:
0%| | 0/500 [00:00<?, ?it/s]Traceback (most recent call last):
File "run.py", line 631, in <module>
main()
File "run.py", line 580, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1591, in train
return inner_training_loop(
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1870, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/accelerate/data_loader.py", line 560, in __iter__
next_batch, next_batch_info = self._fetch_batches(main_iterator)
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/accelerate/data_loader.py", line 523, in _fetch_batches
batches.append(next(iterator))
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1379, in __iter__
for key, example in ex_iterable:
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 862, in __iter__
yield from self._iter()
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 899, in _iter
for key, example in iterator:
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 982, in __iter__
for x in self.ex_iterable:
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 678, in __iter__
yield from self._iter()
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 740, in _iter
for key, example in iterator:
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1114, in __iter__
for key, example in self.ex_iterable:
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1020, in __iter__
yield from islice(self.ex_iterable, self.n, None)
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 233, in __iter__
yield from self.generate_examples_fn(**self.kwargs)
File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 311, in _generate_examples
raise ValueError(
ValueError: audio at corpora/voxpopuli-train/20090203-0900-PLENARY-11-cs_20090203-19#25#56_4.wav doesn't have metadata in /media/win/temp/data/metadata.csv.