Hello, I’m having problems with creating a large dataset from a 200gb+ store of 200,000 audio files. Since I cannot load the entire dataset into memory, I’m using Dataset.from_generator
.
However, I’m finding that it slows down as more examples are added. for the first ~3000 samples, the data is processed at 30examples/s. Past sample 5000 this slows to 1 example/s, and now at ~20000 samples processed, the generator is taking 3s/sample.
- Question 1: Is there a better way of doing this: loading each audio file, computing the metadata and adding it into a dataset? I don’t have the resources to load the entire 200gb table into memory.
- Question 2: Is the usage of
from_generator
correct here?
Multiprocessing will be tough since myds_generator
calls pytorch models.
My code:
My function ds_generator
takes in the audio filepaths and computes the rows of the dataset. I call the generator in create_large_dataset
I’ve verified that there’s no issue in running the computation alone (simply generating the dataset values without saving).
Here are my generator and dataset creation functions.
def ds_generator(feature_dir, speaker_ids, filenames, audiopaths):
for ii, (spkr, fname, audpath) in enumerate(zip(speaker_ids, filenames, audiopaths)):
mfcc_feats = get_mfcc_ivec_feats(audpath)
likelihoods = compute_likelihood(audpath)
transcript = get_transcript(audpath)
audio = librosa.load(audpath, sr=16000)
data = {'input_values': audio[0],'likelihoods':likelihoods, 'mfcc_ivector': np.array(feats),
'filepath': audpath, 'fileid': fname}
yield data
def create_large_dataset(audio_directory, file_extension, dataset_name):
audio_filepaths = get_all_filetype_in_dir(audio_directory, file_extension)
speaker_ids, filenames, feature_dir, filtered_filepaths = \
generate_dataset(audio_paths=audio_filepaths, dataset_name=dataset_name)
dataset_fn = functools.partial(ds_generator, feature_dir=feature_dir, speaker_ids=speaker_ids,
filenames=filenames, audiopaths=filtered_filepaths)
ds = Dataset.from_generator(dataset_fn, cache_dir=audio_directory)
return ds