ValueError: audio at <filename> doesn't have metadata in <path>/metadata.csv

Hi,

I am trying to finetune Whisper on a small compilation of corpora of the Czech language. I am basically using https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/run_speech_recognition_seq2seq_streaming.py modified for AudioFolder by using “audiofolder” as dataset name and adding the data_dir parameter. However, I get the above exception.

It seems that instead of loading the data based on the metadata file, all data from the data directory is loaded and then the metadata file is searched for the corresponding entry. If so, isn’t this problenatic performancewise? And what are my options besides copying all desired files to a new location just for loading?

Full stack trace:

0%|          | 0/500 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run.py", line 631, in <module>
    main()
  File "run.py", line 580, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1870, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/accelerate/data_loader.py", line 560, in __iter__
    next_batch, next_batch_info = self._fetch_batches(main_iterator)
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/accelerate/data_loader.py", line 523, in _fetch_batches
    batches.append(next(iterator))
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1379, in __iter__
    for key, example in ex_iterable:
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 862, in __iter__
    yield from self._iter()
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 899, in _iter
    for key, example in iterator:
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 982, in __iter__
    for x in self.ex_iterable:
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 678, in __iter__
    yield from self._iter()
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 740, in _iter
    for key, example in iterator:
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1114, in __iter__
    for key, example in self.ex_iterable:
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1020, in __iter__
    yield from islice(self.ex_iterable, self.n, None)
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 233, in __iter__
    yield from self.generate_examples_fn(**self.kwargs)
  File "/home/vojta/speech/whisper/tiny-cs/venv/lib/python3.8/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 311, in _generate_examples
    raise ValueError(
ValueError: audio at corpora/voxpopuli-train/20090203-0900-PLENARY-11-cs_20090203-19#25#56_4.wav doesn't have metadata in /media/win/temp/data/metadata.csv.

It seems that instead of loading the data based on the metadata file, all data from the data directory is loaded and then the metadata file is searched for the corresponding entry. If so, isn’t this problenatic performancewise?

Indeed, this is what we should have done initially. I’ll try to find some time in the coming weeks to optimize the implementation a bit.

In the meantime, you can modify the script code to use Dataset.from_generator to load the dataset using a generator (to avoid copying the files).

Thank you, I will use from_generator. In the meantime, I have just replaced the raising line in https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py with continue. While the performance is indeed abysmal (I have a lot of files), training kind of works, but I got this:

[INFO|trainer.py:1760] 2023-10-12 16:01:32,711 >> ***** Running training *****
[INFO|trainer.py:1761] 2023-10-12 16:01:32,711 >>   Num examples = 8,000
[INFO|trainer.py:1762] 2023-10-12 16:01:32,711 >>   Num Epochs = 9,223,372,036,854,775,807
[...]
[INFO|trainer.py:3213] 2023-10-12 18:15:22,317 >> ***** Running Evaluation *****
[INFO|trainer.py:3217] 2023-10-12 18:15:22,318 >>   Num examples: Unknown

There are more than 1M files in the metadata file and eval samples were generated by

            raw_datasets["eval"] = data.take(20000)

(these were skipped in the train set)

Training is indeed very quick and evaluation takes much more time. Why would the train set be limited to 8000? I don’t even know where the number comes from. There is the max_train_samples parameter, but I definitely don’t use it and default is None. Also, what could have caused the insane number of epochs?

I think it’s better to ask these questions in the huggingface/community-events repo (via an issue).

OK, these are training related questions.

I have done the loading from generator, simple, but I had hard time determining how Audio works. In the documentation it is stated that it takes some parameters, but no mention of the encode_example method; I kept passing it to the constructor.

def load_streaming_dataset_from_folder(data_dir):
    def data_generator(data_dir):
        with open(os.path.join(data_dir, 'metadata.csv'), 'r', encoding='utf-8') as f:
            audio_encoder = Audio(sampling_rate=16000)
            for line in f:
                if line.startswith('file_name'):
                    continue
                file_name, *transcription = line.split(',')
                full_file_name = os.path.join(data_dir, file_name)
                audio = audio_encoder.encode_example(value=full_file_name)
                yield {
                    "audio": audio,
                    "transcription": transcription
                }
    ds = datasets.IterableDataset.from_generator(data_generator, 
                features=Features({"audio": Audio, 
                "transcription":  Value(dtype='string')}),  gen_kwargs={'data_dir': data_dir})
    return ds

Audio also supports audio paths :slightly_smiling_face: , so the cleanest solution is to replace audio = audio_encoder.encode_example(value=full_file_name) with audio = full_file_name, and specify the features as Features({"audio": Audio(sampling_rate=16000), "transcription": Value('string')})

Hi,

thanks, I tried it, for some reason performance when only giving path in the audio was much worse. Computing features was longer, which makes sense, but overall it took much longer as well.

Now I need to get the audio bytes from the audio samples and I am stuck. I don’t get the way audio data is stored There is an array variable, supposedly an Arrow array, but it is always zero during my tests. Is thdre a method I would call on the audio and get bytes, or how could I do it?