@polinaeterna Thanks for your response.
After debugging ArrowWriter I hack this with setting max batch size to something smaller than 1000:
datasets.config.DEFAULT_MAX_BATCH_SIZE = 10
dataset = datasets.load_dataset("musdb_dataset.py")
And now it works.
The problem is that in the code above I try to load 5 audio files each of 3 minutes in 44kHz resolution which is about 150MB and make one sample out of it. I was to cut them in smaller pieces later and just wanted to run this code as a sanity check.
And during debugging i saw ArrowWriter.write
method
if self._check_duplicates:
# Create unique hash from key and store as (key, example) pairs
hash = self._hasher.hash(key)
self.current_examples.append((example, hash))
# Maintain record of keys and their respective hashes for checking duplicates
self.hkey_record.append((hash, key))
else:
# Store example as a tuple so as to keep the structure of `self.current_examples` uniform
self.current_examples.append((example, ""))
if writer_batch_size is None:
writer_batch_size = self.writer_batch_size
if writer_batch_size is not None and len(self.current_examples) >= writer_batch_size:
if self._check_duplicates:
self.check_duplicate_keys()
# Re-intializing to empty list for next batch
self.hkey_record = []
self.write_examples_on_file()
It was appending example to self.current_examples and slowly filling entire RAM. self.writer_batch_size is read from datasets.config.DEFAULT_MAX_BATCH_SIZE
and after setting it to some smaller value arrow file is saved before bloating RAM.