XLSR-Wav2Vec2 with punctuation

Hi,

I’ve been trying to train XLSR-Wav2Vec2 to predict transcription + “relevant” punctuation (typically we don’t keep the punctuation).

The idea was to get punctuation in an end-to-end manner as the audio sample gives us additional hints to differentiate between statements, questions and exclamations vs doing an additional post-processing.

The goal is to be able to speak without saying “period”, “question mark”, etc… which is unnatural.

Here are my main steps:

  • I started from the transformers example run_common_voice
  • I use the CommonVoice English dataset as it’s easier to preprocess than other languages
  • I use unidecode to preprocess the text which does a lot of smart changes → Málaga becomes Malaga, François becomes Francois, etc
  • my regex of chars to remove is "()[\]_+/=%|` (was tricky to create, the order here matters)
  • I have a dict of resamplers (since they’re not all 16,000)
  • I filter by duration

Not sure if the wer metric should be adapted. Maybe I should add a separator between the punctuation but based on the way it’s calculated, I feel like it should decrease regardless.

So far my training loss reduces (when using the full dataset it gets to nan probably due to some corrupted examples) but I keep a wer of 1. When testing a long run, I just get an empty output.

To reproduce:

  • clone this repo
  • python run_common_voice.py --dataset_config_name en --output_dir ./model --overwrite_output_dir --model_name_or_path facebook/wav2vec2-large-xlsr-53 --num_train_epochs 3 --per_device_train_batch_size 16 --evaluation_strategy epoch --fp16 --freeze_feature_extractor --group_by_length --gradient_checkpointing --do_train --do_eval --save_total_limit 1 --logging_steps 100 --warmup_steps 500 --load_best_model_at_end --metric_for_best_model wer --greater_is_better False --gradient_accumulation 2 --activation_dropout 0.055 --attention_dropout 0.094 --feat_proj_dropout 0.04 --hidden_dropout 0.047 --layerdrop 0.041 --learning_rate 0.000234 --mask_time_prob 0.082 --per_device_eval_batch_size 8

Feel free to give any suggestions. I’ll update if I get more interesting results.

3 Likes

Hi, how did you preprocess the punctuation?