Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!)

pkadambi · May 17, 2023, 5:24am

Hello, I’m having problems with creating a large dataset from a 200gb+ store of 200,000 audio files. Since I cannot load the entire dataset into memory, I’m using Dataset.from_generator.
However, I’m finding that it slows down as more examples are added. for the first ~3000 samples, the data is processed at 30examples/s. Past sample 5000 this slows to 1 example/s, and now at ~20000 samples processed, the generator is taking 3s/sample.

Question 1: Is there a better way of doing this: loading each audio file, computing the metadata and adding it into a dataset? I don’t have the resources to load the entire 200gb table into memory.
Question 2: Is the usage of from_generator correct here?
Multiprocessing will be tough since my ds_generator calls pytorch models.

My code:
My function ds_generator takes in the audio filepaths and computes the rows of the dataset. I call the generator in create_large_dataset
I’ve verified that there’s no issue in running the computation alone (simply generating the dataset values without saving).

Here are my generator and dataset creation functions.


def ds_generator(feature_dir, speaker_ids, filenames, audiopaths):

    for ii, (spkr, fname, audpath) in enumerate(zip(speaker_ids, filenames, audiopaths)):


        mfcc_feats = get_mfcc_ivec_feats(audpath)
        likelihoods = compute_likelihood(audpath)
        transcript = get_transcript(audpath)
        audio = librosa.load(audpath, sr=16000)
        data = {'input_values': audio[0],'likelihoods':likelihoods,  'mfcc_ivector': np.array(feats),
                'filepath': audpath, 'fileid': fname}
        yield data

def create_large_dataset(audio_directory, file_extension, dataset_name):

    audio_filepaths = get_all_filetype_in_dir(audio_directory, file_extension)
    speaker_ids, filenames, feature_dir, filtered_filepaths = \
        generate_dataset(audio_paths=audio_filepaths, dataset_name=dataset_name)

    dataset_fn = functools.partial(ds_generator, feature_dir=feature_dir, speaker_ids=speaker_ids,
                              filenames=filenames, audiopaths=filtered_filepaths)
    ds = Dataset.from_generator(dataset_fn, cache_dir=audio_directory)
    return ds

pkadambi · May 18, 2023, 4:59pm

Issue solved, there was a memory leak in one of my metadata extraction functions. The old version of the function didn’t have this so I missed the leak.

Topic		Replies	Views
Creating dataset slow 🤗Datasets	5	140	December 18, 2024
How does Dataset.from_generator store data bigger than RAM? 🤗Datasets	1	19	June 19, 2025
Strategy for generating a large dataset 🤗Datasets	3	2345	October 28, 2023
Create a dataset from generator 🤗Datasets	7	7784	January 30, 2024
How to serialise very large generator to disk 🤗Datasets	2	574	September 30, 2022

Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!)

Related topics