German ASR: Fine-Tuning Wav2Vec2

patrickvonplaten · March 18, 2021, 7:46am

Hey everybody,

I’m planning on fine-tuning XLSR-Wav2Vec2 in German. I’m happy about any tips / collaborations with the community

burrjh · March 18, 2021, 8:53am

Hi Patrick!

I will work on the german dataset, too, and would very much appreciate discussions and collaborations, because - to be honest - I have never done it before.

stefan-it · March 18, 2021, 9:40am

Hi Patrick,

German dataset is very resource intensive, one step before actual fine-tuning/training the complete dataset folder has a size of 830GB. However, 64GB RAM seems to be sufficient for the whole pre-processing stuff.

patrickvonplaten · March 18, 2021, 12:26pm

That’s a very good point @stefan-it ! We might have to lazy load the dataset into memory then for each batch

patrickvonplaten · March 18, 2021, 12:27pm

I’ll try to adapt the colab slighly to show how such a use case could work!

stefan-it · March 18, 2021, 6:29pm

@patrickvonplaten would be awesome. I just found out that the

import librosa
import numpy as np

def resample(batch):
    batch["speech"] = librosa.resample(np.asarray(batch["speech"]), 48_000, 16_000)
    batch["sampling_rate"] = 16_000
    return batch

common_voice_train = common_voice_train.map(resample, num_proc=1)

was not really working. It takes very (!) long even after 8 - 10 hours it hasn’t finished (and I’m using a very fast CPU).

I could test a fix - and fortunately I’ve found a 8TB HDD under the desk

patrickvonplaten · March 18, 2021, 8:53pm

Yes, the resampling takes a lot of time! I think torchaudio.resample(...) works faster actually. You might want to use this script by @valhalla instead: transformers/run_common_voice.py at dcebe254fadfe142b6f0d6301cc8a875dca7d603 · huggingface/transformers · GitHub

Luckily we only need to do this step once. Afterward, we can save the dataset with: Main classes — datasets 1.5.0 documentation

patrickvonplaten · March 19, 2021, 9:21am

Re: lazy data loading. After a talk with @valhalla, there is actually no need for special code to run “high-resource” language models. The datasets library never loads the whole dataset into RAM, when applying the .map() function. It only loads writer_batch_size samples into RAM when using .map() - see docs here and then saves the mapped batch to the disk. You can increase or decrease the function argument writer_batch_size when using .map(...) to fit your needs best.

This means that every .map(...) call saves a significant amount of data onto disk meaning if you use .map(...) three times of a dataset of size 100GB it will cache 300GB of data.

Therefore you can do two things to reduce the required amount of hard drive storage:

Remove the cache regularly. This can be as easy as rm -r ~/.cache/huggingface/datasets to remove all cached datasets or making use of this convenient function: https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cache#datasets.Dataset.cleanup_cache_files that only remove a dateset specific cache
Try to use as few .map(...) operations as possible.

burrjh · March 24, 2021, 8:31pm

Hi there,

I run into a pretty basic problem when downloading the data.

Everytime, both in Google Colab and on OVHCloud the download stops at exactly 79%.
I guess it runs out of memory? But I don’t know how to fix it.

Can anybody hint me to a solution?

marcel · March 24, 2021, 9:41pm

I switched to OVH for this reason…

marcel · March 25, 2021, 1:34am

OVHCloud you have to add additional volumes… that is what currently works for me:

ovhai job run \
    --gpu 1 \
    --name hf-wav2vec-de02 \
    --volume data@GRA/de:/workspace/data/de:RW:cache \
    --volume output_models@GRA/de:/workspace/ouptut_models/de:RW:cache \
    --volume dev@GRA:/workspace/my_dev_bucket:RW:cache \
    --volume cache@GRA:/workspace/.cache:RW:cache \
    databuzzword/hf-wav2vec

maxidl · March 25, 2021, 10:04am

Hi everyone,
I joined your efforts today.
Regarding the large disk space consumption I found that after this step:

# Preprocessing the datasets.
# We need to read the aduio files as arrays and tokenize the targets.
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    batch["sampling_rate"] = 16_000
    batch["target_text"] = batch["text"]
    return batch

train_dataset = train_dataset.map(
    speech_file_to_array_fn,
    remove_columns=train_dataset.column_names,
    num_proc=data_args.preprocessing_num_workers,
)
eval_dataset = eval_dataset.map(
    speech_file_to_array_fn,
    remove_columns=eval_dataset.column_names,
    num_proc=data_args.preprocessing_num_workers,
)

although the batch[‘speech’] numpy arrays are of type fp32, the arrow table train_dataset.data reports that the values are doubles. I think this is because it is converted to a python List[float] somewhere before it is added to the arrow table, and the inferred format becomes fp64.
Thus, the uncompressed size is potentially much larger, compared to compressed mp3 or uncompressed fp32.

Additionally, this step does not seem to run for me at all when using multiple workers.

Im looking to replace this step by saving raw tensors in fp32 to disk and then using a custom dataset during training.

maxidl · March 25, 2021, 7:31pm

if anyone else faces a long delay before the training starts: disable group_by_length

in trainer_pt_utils.py line 506, this is called:
lengths = [len(feature[self.model_input_name]) for feature in dataset]
which iterates over the whole dataset without use of dataloader workers. this took a pretty long time in my case.

In the long run in might be worth to submit a pull request to change this behavior, or add a flag to enable getting the lengths using a dataloader?

marcel · March 25, 2021, 8:28pm

Anyone any success? Here is my training on 3% marcel/wav2vec2-large-xlsr-german-demo · Hugging Face

maxidl · March 26, 2021, 9:42am

No success for me yet.
Since I already shared my repo on slack, here is a link: GitHub - maxidl/wav2vec2

Currently, I’m running the preprocessing part once again, after fixing a bug (processor is not deterministic across runs, so I save it to disk in the preprocessing step now). in ~4 hours I can hopefully start training some models.

maxidl · March 26, 2021, 8:21pm

So i was stuck the whole day because of segmentation faults until I found this issue:
https://github.com/pytorch/pytorch/issues/54752

TLDR: do not use torch.set_num_threads()

maxidl · March 28, 2021, 10:13pm

Since the sprint is nearing the end, I uploaded my latest model checkpoint to the Hub: maxidl/wav2vec2-large-xlsr-german · Hugging Face achieving a WER of 12.62%.

Based on the loss curves the training is not converged yet, so it might be worth going for even longer training runs. For now, I only managed to get 50k steps, taking around 30 hours on a single A100. Unfortunately, I did not have great success with distributed training.

Elaben · February 18, 2022, 9:28am

Hello everyone,
I have a similar problem:
I followed the tutorial “Fine-Tune Wav2Vec2 for English ASR with Transformers”, but replacing the TIMIT database by Librispeech.
It turns out that the following code causes the whole corpus to be put in RAM:
timit = timit.map(prepare_dataset, remove_columns=timit.column_names["train"], num_proc=4)
Is there any way to work around this problem?
Note that I am working on a local server with 200 GB of RAM and a GeForce RTX 3090 GPU with 24 GB.
Thanks in advance for your help

Topic		Replies	Views
Russian ASR: Fine-tuning Wav2Vec2 Languages at Hugging Face	20	2701	May 22, 2021
Hindi ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	19	3010	January 4, 2022
Indonesian ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	35	2571	March 1, 2023
Swedish ASR: Fine Tuning Wav2Vec2 Models	4	865	March 23, 2021
Dutch ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	0	369	March 20, 2021

German ASR: Fine-Tuning Wav2Vec2

Related topics