How to load this simple audio data set and use dataset.map without memory issues?

simeneide · December 9, 2024, 10:39pm

Hey! I spent some days trying to understand this, constantly getting OOM. And setting cache_file_name='test' was a bit brittle, as it would just use that cache no matter the fingerprint.

It seems like the datasets.from_dict() doesnt have any cache files, so I had to save to csv and then load with the csv-loader (which seemed to have some cache functionality):

    pd.DataFrame({'id' : folders}).to_csv("file.csv", index=False)
    ds_ids = datasets.Dataset.from_csv("file.csv")

Topic		Replies	Views
Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")` Beginners	3	3576	June 8, 2022
Loading data from Datasets takes too much memory 🤗Datasets	2	559	January 18, 2024
Dataset map during runtime 🤗Datasets	2	1297	September 13, 2023
Misunderstanding around creating audio datasets from Local files 🤗Datasets	12	1765	July 17, 2023
.map - function overloads my Cache Beginners	3	207	August 21, 2023

How to load this simple audio data set and use dataset.map without memory issues?

Related topics