AudioFolder loads only half of the files in a folder

I have two batches of wav files from different sources copied into the same folder. I also have a metadata.csv listing all of the files from the two batches in the same folder. However, when I run load_dataset on this folder it somehow mysteriously produces a Dataset with files from the first batch only. How does it know where the files came from?

I tried cleanup_cache_files, tried to delete cache altogether, but to no avail.

If the folder holds only the files from one of the batches (either first or second) and metadata listing files from this batch, I get the correct dataset. But with both batches, I get only the first one.

The metadata.csv file is always prepared the same way and has two columns: file_name and transcription.

How could that be and how can I remedy this?

This is the loading script:

dataset = load_dataset(
    "audiofolder", 
    data_dir=ds_path,
    cache_dir='./dataset_cache/'
)

Hi ! Do you have an example of dataset that reproduces this issue ? Can you upload it on HF ?

Otherwise please make sure that your files are all under the same directory and that you have one single metadata file.

Note that the datasets cache is invalidated as soon as at least one file is modified (based on the last modified dates of the files)

Hi! That’s the mystery: all files were in the same folder with a single metadata.csv listing all files.

The files come from this Indian English dataset

But I only used the IndicTTS_Phase2_Assamese_fem_Speaker1_english and IndicTTS_Phase2_Assamese_male_Speaker1_english parts.

The file names are based on different patterns: ASF001-EN-ST…wav for the female part and train_assamesemale_…wav for the male part.

If both parts are in the same folder, the load_dataset function picks only the male files. even though metadata.csv includes all.

If the folder holds only male or only female files (and the correct metadata), than it works correctly.

So I wonder whether there is a file naming requirement. Some kind of pattern?

P.S. By the way, the function worked fine with the Tamil female and male part where files are named train_tamilfemale_…wav and train_tamilmale_…wav respectively.

I see !

The viewer infers the spltis (train/test/valid) based on the file names. In your case the viewer is showing the “train” split which only covers the files containing train in their names.

You can specify data_files="**/*.wav" in load_dataset()to fix this.

If the dataset is also on the HF Hub you can also set it as the default in the README.md header using

configs:
- config_name: default
  data_files: "**/*.wav"

Thank you! Now, I see too :slight_smile:

Maybe, this information on file naming patterns should be available someplace (maybe it is but I did not see it) but I do not know who I should write to.

Feel free to propose a contribution to hub-docs/docs/hub/datasets-file-names-and-splits.md at main · huggingface/hub-docs · GitHub (which is the source code for File names and splits). It would be greatly appreciated!

Thanks! File naming conventions are well described at these pages. I just missed the description somehow, being new to the platform.

1 Like