Memory error while loading custom dataset

sebchw · March 3, 2023, 1:52pm

@polinaeterna Thanks for your response.

After debugging ArrowWriter I hack this with setting max batch size to something smaller than 1000:

datasets.config.DEFAULT_MAX_BATCH_SIZE = 10
dataset = datasets.load_dataset("musdb_dataset.py")

And now it works.

The problem is that in the code above I try to load 5 audio files each of 3 minutes in 44kHz resolution which is about 150MB and make one sample out of it. I was to cut them in smaller pieces later and just wanted to run this code as a sanity check.

And during debugging i saw ArrowWriter.write method

    if self._check_duplicates:
            # Create unique hash from key and store as (key, example) pairs
            hash = self._hasher.hash(key)
            self.current_examples.append((example, hash))
            # Maintain record of keys and their respective hashes for checking duplicates
            self.hkey_record.append((hash, key))
        else:
            # Store example as a tuple so as to keep the structure of `self.current_examples` uniform
            self.current_examples.append((example, ""))

        if writer_batch_size is None:
            writer_batch_size = self.writer_batch_size
        if writer_batch_size is not None and len(self.current_examples) >= writer_batch_size:
            if self._check_duplicates:
                self.check_duplicate_keys()
                # Re-intializing to empty list for next batch
                self.hkey_record = []

            self.write_examples_on_file()

It was appending example to self.current_examples and slowly filling entire RAM. self.writer_batch_size is read from datasets.config.DEFAULT_MAX_BATCH_SIZE and after setting it to some smaller value arrow file is saved before bloating RAM.

Topic		Replies	Views
Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray) 🤗Datasets	2	1137	January 25, 2022
How to save/use only the first 20k samples of a dataset 🤗Datasets	1	64	December 23, 2024
Common Voice 8.0.0 en using all available RAM 🤗Datasets	7	907	August 5, 2022
GeneratorBasedBuilder gets stuck & consumes all RAM 🤗Datasets	2	787	February 8, 2022
Arrowmemoryerror: realloc of size 32 GB failed 🤗Datasets	2	3255	January 6, 2023

Memory error while loading custom dataset

Related topics