Memory error while loading custom dataset

polinaeterna · March 3, 2023, 1:05pm

hi @sebchw! I’m not sure what’s causing the error and memory overload (do you have any ideas, @lhoestq ?) but note that when you provide arrays in audio feature, what it does under the hood is actually writing arrays to bytes and storing audios as bytes. And then after you load the dataset and access samples, audios are decoded on the fly with the datasets library standard decoding. We should clarify this in the docs I think.

So if you want to apply your custom decoding with stempeg, you can set decode=False to audio features (in _info) and provide only paths to local audio files in generate_examples, smth like:

    def _generate_examples(self, audio_path):
        id_ = 0
        names = ["mixture", "drums", "bass", "other", "vocals"]

        for stems_path in Path(audio_path).iterdir():
            yield id_, {
                "name": stems_path.stem,
                **{name: {"path": stems_path} for name in names}
            }
            id_ += 1

and then use your custom decoding function on the loaded dataset.

Topic		Replies	Views
Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray) 🤗Datasets	2	1139	January 25, 2022
How to save/use only the first 20k samples of a dataset 🤗Datasets	1	65	December 23, 2024
Common Voice 8.0.0 en using all available RAM 🤗Datasets	7	907	August 5, 2022
GeneratorBasedBuilder gets stuck & consumes all RAM 🤗Datasets	2	787	February 8, 2022
Arrowmemoryerror: realloc of size 32 GB failed 🤗Datasets	2	3272	January 6, 2023

Memory error while loading custom dataset

Related topics