ValueError: audio at <filename> doesn't have metadata in <path>/metadata.csv


I am trying to finetune Whisper on a small compilation of corpora of the Czech language. I am basically using modified for AudioFolder by using “audiofolder” as dataset name and adding the data_dir parameter. However, I get the above exception.

It seems that instead of loading the data based on the metadata file, all data from the data directory is loaded and then the metadata file is searched for the corresponding entry. If so, isn’t this problenatic performancewise? And what are my options besides copying all desired files to a new location just for loading?

Full stack trace:

ValueError: audio at corpora/voxpopuli-train/20090203-0900-PLENARY-11-cs_20090203-19#25#56_4.wav doesn't have metadata in /media/win/temp/data/metadata.csv.

It seems that instead of loading the data based on the metadata file, all data from the data directory is loaded and then the metadata file is searched for the corresponding entry. If so, isn’t this problenatic performancewise?

Indeed, this is what we should have done initially. I’ll try to find some time in the coming weeks to optimize the implementation a bit.

In the meantime, you can modify the script code to use Dataset.from_generator to load the dataset using a generator (to avoid copying the files).

Thank you, I will use from_generator. In the meantime, I have just replaced the raising line in with continue. While the performance is indeed abysmal (I have a lot of files), training kind of works, but I got this:

[INFO|] 2023-10-12 16:01:32,711 >> ***** Running training *****
[INFO|] 2023-10-12 16:01:32,711 >>   Num examples = 8,000
[INFO|] 2023-10-12 16:01:32,711 >>   Num Epochs = 9,223,372,036,854,775,807
[INFO|] 2023-10-12 18:15:22,317 >> ***** Running Evaluation *****
[INFO|] 2023-10-12 18:15:22,318 >>   Num examples: Unknown

There are more than 1M files in the metadata file and eval samples were generated by

            raw_datasets["eval"] = data.take(20000)

(these were skipped in the train set)

Training is indeed very quick and evaluation takes much more time. Why would the train set be limited to 8000? I don’t even know where the number comes from. There is the max_train_samples parameter, but I definitely don’t use it and default is None. Also, what could have caused the insane number of epochs?

I think it’s better to ask these questions in the huggingface/community-events repo (via an issue).

OK, these are training related questions.

I have done the loading from generator, simple, but I had hard time determining how Audio works. In the documentation it is stated that it takes some parameters, but no mention of the encode_example method; I kept passing it to the constructor.

def load_streaming_dataset_from_folder(data_dir):
    def data_generator(data_dir):
        with open(os.path.join(data_dir, 'metadata.csv'), 'r', encoding='utf-8') as f:
            audio_encoder = Audio(sampling_rate=16000)
            for line in f:
                if line.startswith('file_name'):
                file_name, *transcription = line.split(',')
                full_file_name = os.path.join(data_dir, file_name)
                audio = audio_encoder.encode_example(value=full_file_name)
                yield {
                    "audio": audio,
                    "transcription": transcription
    ds = datasets.IterableDataset.from_generator(data_generator, 
                features=Features({"audio": Audio, 
                "transcription":  Value(dtype='string')}),  gen_kwargs={'data_dir': data_dir})
    return ds

Audio also supports audio paths :slightly_smiling_face: , so the cleanest solution is to replace audio = audio_encoder.encode_example(value=full_file_name) with audio = full_file_name, and specify the features as Features({"audio": Audio(sampling_rate=16000), "transcription": Value('string')})


thanks, I tried it, for some reason performance when only giving path in the audio was much worse. Computing features was longer, which makes sense, but overall it took much longer as well.

Now I need to get the audio bytes from the audio samples and I am stuck. I don’t get the way audio data is stored There is an array variable, supposedly an Arrow array, but it is always zero during my tests. Is thdre a method I would call on the audio and get bytes, or how could I do it?