German ASR: Fine-Tuning Wav2Vec2

patrickvonplaten · March 19, 2021, 9:21am

Re: lazy data loading. After a talk with @valhalla, there is actually no need for special code to run “high-resource” language models. The datasets library never loads the whole dataset into RAM, when applying the .map() function. It only loads writer_batch_size samples into RAM when using .map() - see docs here and then saves the mapped batch to the disk. You can increase or decrease the function argument writer_batch_size when using .map(...) to fit your needs best.

This means that every .map(...) call saves a significant amount of data onto disk meaning if you use .map(...) three times of a dataset of size 100GB it will cache 300GB of data.

Therefore you can do two things to reduce the required amount of hard drive storage:

Remove the cache regularly. This can be as easy as rm -r ~/.cache/huggingface/datasets to remove all cached datasets or making use of this convenient function: https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cache#datasets.Dataset.cleanup_cache_files that only remove a dateset specific cache
Try to use as few .map(...) operations as possible.

Topic		Replies	Views
Russian ASR: Fine-tuning Wav2Vec2 Languages at Hugging Face	20	2697	May 22, 2021
Hindi ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	19	3005	January 4, 2022
Indonesian ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	35	2564	March 1, 2023
Swedish ASR: Fine Tuning Wav2Vec2 Models	4	864	March 23, 2021
Dutch ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	0	369	March 20, 2021

German ASR: Fine-Tuning Wav2Vec2

Related topics