XLSR-Wav2Vec2 with punctuation

boris · April 26, 2021, 2:07pm

Hi,

I’ve been trying to train XLSR-Wav2Vec2 to predict transcription + “relevant” punctuation (typically we don’t keep the punctuation).

The idea was to get punctuation in an end-to-end manner as the audio sample gives us additional hints to differentiate between statements, questions and exclamations vs doing an additional post-processing.

The goal is to be able to speak without saying “period”, “question mark”, etc… which is unnatural.

Here are my main steps:

I started from the transformers example run_common_voice
I use the CommonVoice English dataset as it’s easier to preprocess than other languages
I use unidecode to preprocess the text which does a lot of smart changes → Málaga becomes Malaga, François becomes Francois, etc
my regex of chars to remove is "()[\]_+/=%|` (was tricky to create, the order here matters)
I have a dict of resamplers (since they’re not all 16,000)
I filter by duration

Not sure if the wer metric should be adapted. Maybe I should add a separator between the punctuation but based on the way it’s calculated, I feel like it should decrease regardless.

So far my training loss reduces (when using the full dataset it gets to nan probably due to some corrupted examples) but I keep a wer of 1. When testing a long run, I just get an empty output.

To reproduce:

clone this repo
python run_common_voice.py --dataset_config_name en --output_dir ./model --overwrite_output_dir --model_name_or_path facebook/wav2vec2-large-xlsr-53 --num_train_epochs 3 --per_device_train_batch_size 16 --evaluation_strategy epoch --fp16 --freeze_feature_extractor --group_by_length --gradient_checkpointing --do_train --do_eval --save_total_limit 1 --logging_steps 100 --warmup_steps 500 --load_best_model_at_end --metric_for_best_model wer --greater_is_better False --gradient_accumulation 2 --activation_dropout 0.055 --attention_dropout 0.094 --feat_proj_dropout 0.04 --hidden_dropout 0.047 --layerdrop 0.041 --learning_rate 0.000234 --mask_time_prob 0.082 --per_device_eval_batch_size 8

Feel free to give any suggestions. I’ll update if I get more interesting results.

DinBav · October 12, 2022, 5:24pm

Hi, how did you preprocess the punctuation?

Topic		Replies	Views
Spanish ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	9	2991	March 26, 2021
What is the preferred way to preprocess punctuation? 🤗Transformers	0	237	October 13, 2022
Ideas to correct Wav2Vec2 transcription results Beginners	1	1001	May 11, 2021
Wav2Vec2: loss growing in training and validation after few epochs Models	6	2044	September 25, 2024
Finetunig of wav2vec2-xls-r-300m outputs invalid words for Bengali data Models	6	684	February 1, 2023

XLSR-Wav2Vec2 with punctuation

Related topics