Loading data from Datasets takes too much memory

I am dealing with audio files and have a code to create HF dataset from pandas

image

it is quiet fast however, if I do something like hg_data['train']['audio'][0] to see it if can load the audio (numpy array) my sagemaker notebook crashes (even with 16g memory). or if I do "import

IPython.display as ipd
ipd.Audio(data['train']['audio'][0],
          rate=22050)"

P.S
my features is like

features = Features({
    'tag': ClassLabel(names=unique_genres, id=None),
    'SegID': Value(dtype='int32', id=None),
    'value': Value(dtype='int64', id=None),
    'audio': Value(dtype='string', id=None)
})

_df_train = Dataset.from_pandas(_df_train,
                               features=features)
_df_train = _df_train.cast_column("audio", Audio(sampling_rate=16000))

Hi ! It’s because hg_data['train'] returns the full list in RAM of all the audio data in the dataset.

So instead of using data['train']['audio'][0] you should simply use data['train'][0]['audio'] (which loads only the first example in memory)

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.