German ASR: Fine-Tuning Wav2Vec2

maxidl · March 25, 2021, 10:04am

Hi everyone,
I joined your efforts today.
Regarding the large disk space consumption I found that after this step:

# Preprocessing the datasets.
# We need to read the aduio files as arrays and tokenize the targets.
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    batch["sampling_rate"] = 16_000
    batch["target_text"] = batch["text"]
    return batch

train_dataset = train_dataset.map(
    speech_file_to_array_fn,
    remove_columns=train_dataset.column_names,
    num_proc=data_args.preprocessing_num_workers,
)
eval_dataset = eval_dataset.map(
    speech_file_to_array_fn,
    remove_columns=eval_dataset.column_names,
    num_proc=data_args.preprocessing_num_workers,
)

although the batch[‘speech’] numpy arrays are of type fp32, the arrow table train_dataset.data reports that the values are doubles. I think this is because it is converted to a python List[float] somewhere before it is added to the arrow table, and the inferred format becomes fp64.
Thus, the uncompressed size is potentially much larger, compared to compressed mp3 or uncompressed fp32.

Additionally, this step does not seem to run for me at all when using multiple workers.

Im looking to replace this step by saving raw tensors in fp32 to disk and then using a custom dataset during training.

Topic		Replies	Views
Russian ASR: Fine-tuning Wav2Vec2 Languages at Hugging Face	20	2693	May 22, 2021
Hindi ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	19	3004	January 4, 2022
Indonesian ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	35	2560	March 1, 2023
Swedish ASR: Fine Tuning Wav2Vec2 Models	4	864	March 23, 2021
Dutch ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	0	369	March 20, 2021

German ASR: Fine-Tuning Wav2Vec2

Related topics