I have two batches of wav files from different sources copied into the same folder. I also have a metadata.csv listing all of the files from the two batches in the same folder. However, when I run load_dataset on this folder it somehow mysteriously produces a Dataset with files from the first batch only. How does it know where the files came from?
I tried cleanup_cache_files, tried to delete cache altogether, but to no avail.
If the folder holds only the files from one of the batches (either first or second) and metadata listing files from this batch, I get the correct dataset. But with both batches, I get only the first one.
The metadata.csv file is always prepared the same way and has two columns: file_name and transcription.
But I only used the IndicTTS_Phase2_Assamese_fem_Speaker1_english and IndicTTS_Phase2_Assamese_male_Speaker1_english parts.
The file names are based on different patterns: ASF001-EN-ST…wav for the female part and train_assamesemale_…wav for the male part.
If both parts are in the same folder, the load_dataset function picks only the male files. even though metadata.csv includes all.
If the folder holds only male or only female files (and the correct metadata), than it works correctly.
So I wonder whether there is a file naming requirement. Some kind of pattern?
P.S. By the way, the function worked fine with the Tamil female and male part where files are named train_tamilfemale_…wav and train_tamilmale_…wav respectively.
The viewer infers the spltis (train/test/valid) based on the file names. In your case the viewer is showing the “train” split which only covers the files containing train in their names.
You can specify data_files="**/*.wav" in load_dataset()to fix this.
If the dataset is also on the HF Hub you can also set it as the default in the README.md header using
Maybe, this information on file naming patterns should be available someplace (maybe it is but I did not see it) but I do not know who I should write to.