Memory error while loading custom dataset

@polinaeterna Thanks for your response.

After debugging ArrowWriter I hack this with setting max batch size to something smaller than 1000:

datasets.config.DEFAULT_MAX_BATCH_SIZE = 10
dataset = datasets.load_dataset("musdb_dataset.py")

And now it works.

The problem is that in the code above I try to load 5 audio files each of 3 minutes in 44kHz resolution which is about 150MB and make one sample out of it. I was to cut them in smaller pieces later and just wanted to run this code as a sanity check.

And during debugging i saw ArrowWriter.write method

    if self._check_duplicates:
            # Create unique hash from key and store as (key, example) pairs
            hash = self._hasher.hash(key)
            self.current_examples.append((example, hash))
            # Maintain record of keys and their respective hashes for checking duplicates
            self.hkey_record.append((hash, key))
        else:
            # Store example as a tuple so as to keep the structure of `self.current_examples` uniform
            self.current_examples.append((example, ""))

        if writer_batch_size is None:
            writer_batch_size = self.writer_batch_size
        if writer_batch_size is not None and len(self.current_examples) >= writer_batch_size:
            if self._check_duplicates:
                self.check_duplicate_keys()
                # Re-intializing to empty list for next batch
                self.hkey_record = []

            self.write_examples_on_file()

It was appending example to self.current_examples and slowly filling entire RAM. self.writer_batch_size is read from datasets.config.DEFAULT_MAX_BATCH_SIZE and after setting it to some smaller value arrow file is saved before bloating RAM.