German ASR: Fine-Tuning Wav2Vec2

Hey everybody,

I’m planning on fine-tuning XLSR-Wav2Vec2 in German. I’m happy about any tips / collaborations with the community :slight_smile:

1 Like

Hi Patrick!

I will work on the german dataset, too, and would very much appreciate discussions and collaborations, because - to be honest - I have never done it before.

1 Like

Hi Patrick,

German dataset is very resource intensive, one step before actual fine-tuning/training the complete dataset folder has a size of 830GB. However, 64GB RAM seems to be sufficient for the whole pre-processing stuff.

1 Like

That’s a very good point @stefan-it ! We might have to lazy load the dataset into memory then for each batch

I’ll try to adapt the colab slighly to show how such a use case could work!

@patrickvonplaten would be awesome. I just found out that the

import librosa
import numpy as np

def resample(batch):
    batch["speech"] = librosa.resample(np.asarray(batch["speech"]), 48_000, 16_000)
    batch["sampling_rate"] = 16_000
    return batch

common_voice_train = common_voice_train.map(resample, num_proc=1)

was not really working. It takes very (!) long even after 8 - 10 hours it hasn’t finished (and I’m using a very fast CPU).

I could test a fix - and fortunately I’ve found a 8TB HDD under the desk :sweat_smile:

1 Like

Yes, the resampling takes a lot of time! I think torchaudio.resample(...) works faster actually. You might want to use this script by @valhalla instead: transformers/run_common_voice.py at dcebe254fadfe142b6f0d6301cc8a875dca7d603 · huggingface/transformers · GitHub

Luckily we only need to do this step once. Afterward, we can save the dataset with: Main classes — datasets 1.5.0 documentation

Re: lazy data loading. After a talk with @valhalla, there is actually no need for special code to run “high-resource” language models. The datasets library never loads the whole dataset into RAM, when applying the .map() function. It only loads writer_batch_size samples into RAM when using .map() - see docs here and then saves the mapped batch to the disk. You can increase or decrease the function argument writer_batch_size when using .map(...) to fit your needs best.

This means that every .map(...) call saves a significant amount of data onto disk meaning if you use .map(...) three times of a dataset of size 100GB it will cache 300GB of data.

Therefore you can do two things to reduce the required amount of hard drive storage:

  1. Remove the cache regularly. This can be as easy as rm -r ~/.cache/huggingface/datasets to remove all cached datasets or making use of this convenient function: https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cache#datasets.Dataset.cleanup_cache_files that only remove a dateset specific cache

  2. Try to use as few .map(...) operations as possible.

Hi there,

I run into a pretty basic problem when downloading the data.

Everytime, both in Google Colab and on OVHCloud the download stops at exactly 79%.
I guess it runs out of memory? But I don’t know how to fix it.

Can anybody hint me to a solution?

I switched to OVH for this reason…

OVHCloud you have to add additional volumes… that is what currently works for me:

ovhai job run \
    --gpu 1 \
    --name hf-wav2vec-de02 \
    --volume data@GRA/de:/workspace/data/de:RW:cache \
    --volume output_models@GRA/de:/workspace/ouptut_models/de:RW:cache \
    --volume dev@GRA:/workspace/my_dev_bucket:RW:cache \
    --volume cache@GRA:/workspace/.cache:RW:cache \
    databuzzword/hf-wav2vec

Hi everyone,
I joined your efforts today.
Regarding the large disk space consumption I found that after this step:

# Preprocessing the datasets.
# We need to read the aduio files as arrays and tokenize the targets.
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    batch["sampling_rate"] = 16_000
    batch["target_text"] = batch["text"]
    return batch

train_dataset = train_dataset.map(
    speech_file_to_array_fn,
    remove_columns=train_dataset.column_names,
    num_proc=data_args.preprocessing_num_workers,
)
eval_dataset = eval_dataset.map(
    speech_file_to_array_fn,
    remove_columns=eval_dataset.column_names,
    num_proc=data_args.preprocessing_num_workers,
)

although the batch[‘speech’] numpy arrays are of type fp32, the arrow table train_dataset.data reports that the values are doubles. I think this is because it is converted to a python List[float] somewhere before it is added to the arrow table, and the inferred format becomes fp64.
Thus, the uncompressed size is potentially much larger, compared to compressed mp3 or uncompressed fp32.

Additionally, this step does not seem to run for me at all when using multiple workers.

Im looking to replace this step by saving raw tensors in fp32 to disk and then using a custom dataset during training.

if anyone else faces a long delay before the training starts: disable group_by_length

in trainer_pt_utils.py line 506, this is called:
lengths = [len(feature[self.model_input_name]) for feature in dataset]
which iterates over the whole dataset without use of dataloader workers. this took a pretty long time in my case.

In the long run in might be worth to submit a pull request to change this behavior, or add a flag to enable getting the lengths using a dataloader?

Anyone any success? Here is my training on 3% marcel/wav2vec2-large-xlsr-german-demo · Hugging Face

No success for me yet.
Since I already shared my repo on slack, here is a link: GitHub - maxidl/wav2vec2

Currently, I’m running the preprocessing part once again, after fixing a bug (processor is not deterministic across runs, so I save it to disk in the preprocessing step now). in ~4 hours I can hopefully start training some models.

So i was stuck the whole day because of segmentation faults until I found this issue:
https://github.com/pytorch/pytorch/issues/54752

TLDR: do not use torch.set_num_threads()

Since the sprint is nearing the end, I uploaded my latest model checkpoint to the Hub: maxidl/wav2vec2-large-xlsr-german · Hugging Face achieving a WER of 12.62%.

Based on the loss curves the training is not converged yet, so it might be worth going for even longer training runs. For now, I only managed to get 50k steps, taking around 30 hours on a single A100. Unfortunately, I did not have great success with distributed training.

2 Likes

Hello everyone,
I have a similar problem:
I followed the tutorial “Fine-Tune Wav2Vec2 for English ASR with Transformers”, but replacing the TIMIT database by Librispeech.
It turns out that the following code causes the whole corpus to be put in RAM:
timit = timit.map(prepare_dataset, remove_columns=timit.column_names["train"], num_proc=4)
Is there any way to work around this problem?
Note that I am working on a local server with 200 GB of RAM and a GeForce RTX 3090 GPU with 24 GB.
Thanks in advance for your help