Loading data from Datasets takes too much memory

I am dealing with audio files and have a code to create HF dataset from pandas


it is quiet fast however, if I do something like hg_data['train']['audio'][0] to see it if can load the audio (numpy array) my sagemaker notebook crashes (even with 16g memory). or if I do "import

IPython.display as ipd

my features is like

features = Features({
    'tag': ClassLabel(names=unique_genres, id=None),
    'SegID': Value(dtype='int32', id=None),
    'value': Value(dtype='int64', id=None),
    'audio': Value(dtype='string', id=None)

_df_train = Dataset.from_pandas(_df_train,
_df_train = _df_train.cast_column("audio", Audio(sampling_rate=16000))

Hi ! It’s because hg_data['train'] returns the full list in RAM of all the audio data in the dataset.

So instead of using data['train']['audio'][0] you should simply use data['train'][0]['audio'] (which loads only the first example in memory)

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.