Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!)

Hello, I’m having problems with creating a large dataset from a 200gb+ store of 200,000 audio files. Since I cannot load the entire dataset into memory, I’m using Dataset.from_generator.
However, I’m finding that it slows down as more examples are added. for the first ~3000 samples, the data is processed at 30examples/s. Past sample 5000 this slows to 1 example/s, and now at ~20000 samples processed, the generator is taking 3s/sample.

  • Question 1: Is there a better way of doing this: loading each audio file, computing the metadata and adding it into a dataset? I don’t have the resources to load the entire 200gb table into memory.
  • Question 2: Is the usage of from_generator correct here?
    Multiprocessing will be tough since my ds_generator calls pytorch models.

My code:
My function ds_generator takes in the audio filepaths and computes the rows of the dataset. I call the generator in create_large_dataset
I’ve verified that there’s no issue in running the computation alone (simply generating the dataset values without saving).

Here are my generator and dataset creation functions.


def ds_generator(feature_dir, speaker_ids, filenames, audiopaths):

    for ii, (spkr, fname, audpath) in enumerate(zip(speaker_ids, filenames, audiopaths)):


        mfcc_feats = get_mfcc_ivec_feats(audpath)
        likelihoods = compute_likelihood(audpath)
        transcript = get_transcript(audpath)
        audio = librosa.load(audpath, sr=16000)
        data = {'input_values': audio[0],'likelihoods':likelihoods,  'mfcc_ivector': np.array(feats),
                'filepath': audpath, 'fileid': fname}
        yield data

def create_large_dataset(audio_directory, file_extension, dataset_name):

    audio_filepaths = get_all_filetype_in_dir(audio_directory, file_extension)
    speaker_ids, filenames, feature_dir, filtered_filepaths = \
        generate_dataset(audio_paths=audio_filepaths, dataset_name=dataset_name)

    dataset_fn = functools.partial(ds_generator, feature_dir=feature_dir, speaker_ids=speaker_ids,
                              filenames=filenames, audiopaths=filtered_filepaths)
    ds = Dataset.from_generator(dataset_fn, cache_dir=audio_directory)
    return ds

Issue solved, there was a memory leak in one of my metadata extraction functions. The old version of the function didn’t have this so I missed the leak.