German ASR: Fine-Tuning Wav2Vec2

Re: lazy data loading. After a talk with @valhalla, there is actually no need for special code to run “high-resource” language models. The datasets library never loads the whole dataset into RAM, when applying the .map() function. It only loads writer_batch_size samples into RAM when using .map() - see docs here and then saves the mapped batch to the disk. You can increase or decrease the function argument writer_batch_size when using .map(...) to fit your needs best.

This means that every .map(...) call saves a significant amount of data onto disk meaning if you use .map(...) three times of a dataset of size 100GB it will cache 300GB of data.

Therefore you can do two things to reduce the required amount of hard drive storage:

  1. Remove the cache regularly. This can be as easy as rm -r ~/.cache/huggingface/datasets to remove all cached datasets or making use of this convenient function: https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cache#datasets.Dataset.cleanup_cache_files that only remove a dateset specific cache

  2. Try to use as few .map(...) operations as possible.