Deleting Duplicate Saved Datasets

hose · September 4, 2022, 10:42am

My dataset is eating too much space
I want to remove duplicates and obsolete datasets that I no longer use
how do i do that?

background
I checked the PC_user/.cache file and found that the dataset was downloaded redundantly.Inside hugginhface/datasets/downloads/extracted, something that looks like the main body of libirispeech is saved, but something else ( yalp_rebiew_full squad, etc.) could not be found Instead, there was a terrible amount of cache lined up.If possible, I would like to wipe out these as well.

lhoestq · September 6, 2022, 12:09pm

In .cache/huggingface/datasets you can delete all the datasets that you no longer use (they are stored as Arrow files inside directories named after the datasets you used).

In .cache/huggingface/datasets/downloads you can also remove the raw data files that were downloaded to generate the Arrow datasets

hose · September 7, 2022, 9:31am

Should I do it manually? Will that cause any problems?

lhoestq · September 7, 2022, 3:08pm

First, in case you want to keep librispeech or any audio dataset, please locate the folder containing the audio files in downloads/exctracted. You might want to keep this one for librispeech.

Other than that you can remove the rest

Topic		Replies	Views
Understanding the `Datasets` cache system 🤗Datasets	2	3296	May 19, 2023
Load_dataset without saving cache files 🤗Datasets	1	1850	April 19, 2023
How can I clean the dataset cache? 🤗Datasets	4	11229	March 1, 2024
Remove dataset downloaded by dataset library from local computer Beginners	3	6576	May 2, 2023
Do download files of dataset needed? Beginners	1	988	October 28, 2022

Deleting Duplicate Saved Datasets

Related topics