Loading data from Datasets takes too much memory

Dakhoo · January 17, 2024, 2:47pm

I am dealing with audio files and have a code to create HF dataset from pandas

it is quiet fast however, if I do something like hg_data['train']['audio'][0] to see it if can load the audio (numpy array) my sagemaker notebook crashes (even with 16g memory). or if I do "import

IPython.display as ipd
ipd.Audio(data['train']['audio'][0],
          rate=22050)"

P.S
my features is like

features = Features({
    'tag': ClassLabel(names=unique_genres, id=None),
    'SegID': Value(dtype='int32', id=None),
    'value': Value(dtype='int64', id=None),
    'audio': Value(dtype='string', id=None)
})

_df_train = Dataset.from_pandas(_df_train,
                               features=features)
_df_train = _df_train.cast_column("audio", Audio(sampling_rate=16000))

lhoestq · January 18, 2024, 12:20pm

Hi ! It’s because hg_data['train'] returns the full list in RAM of all the audio data in the dataset.

So instead of using data['train']['audio'][0] you should simply use data['train'][0]['audio'] (which loads only the first example in memory)

system · January 19, 2024, 12:21am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to load this simple audio data set and use dataset.map without memory issues? 🤗Datasets	12	4226	December 10, 2024
Local dataset loading performance: HF's arrow vs torch.load 🤗Datasets	5	1165	November 24, 2024
Missing one feature in dataset when loading from folder 🤗Datasets	2	571	October 31, 2023
Load_dataset is very slow Beginners	3	1547	February 27, 2024
Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")` Beginners	3	3449	June 8, 2022

Loading data from Datasets takes too much memory

Related topics