I’ve been trying to train XLSR-Wav2Vec2 to predict transcription + “relevant” punctuation (typically we don’t keep the punctuation).
The idea was to get punctuation in an end-to-end manner as the audio sample gives us additional hints to differentiate between statements, questions and exclamations vs doing an additional post-processing.
The goal is to be able to speak without saying “period”, “question mark”, etc… which is unnatural.
Here are my main steps:
- I started from the transformers example
- I use the CommonVoice English dataset as it’s easier to preprocess than other languages
- I use
unidecodeto preprocess the text which does a lot of smart changes → Málaga becomes Malaga, François becomes Francois, etc
- my regex of chars to remove is
"()[\]_+/=%|` (was tricky to create, the order here matters)
- I have a dict of resamplers (since they’re not all 16,000)
- I filter by duration
Not sure if the wer metric should be adapted. Maybe I should add a separator between the punctuation but based on the way it’s calculated, I feel like it should decrease regardless.
So far my training loss reduces (when using the full dataset it gets to nan probably due to some corrupted examples) but I keep a wer of 1. When testing a long run, I just get an empty output.
- clone this repo
python run_common_voice.py --dataset_config_name en --output_dir ./model --overwrite_output_dir --model_name_or_path facebook/wav2vec2-large-xlsr-53 --num_train_epochs 3 --per_device_train_batch_size 16 --evaluation_strategy epoch --fp16 --freeze_feature_extractor --group_by_length --gradient_checkpointing --do_train --do_eval --save_total_limit 1 --logging_steps 100 --warmup_steps 500 --load_best_model_at_end --metric_for_best_model wer --greater_is_better False --gradient_accumulation 2 --activation_dropout 0.055 --attention_dropout 0.094 --feat_proj_dropout 0.04 --hidden_dropout 0.047 --layerdrop 0.041 --learning_rate 0.000234 --mask_time_prob 0.082 --per_device_eval_batch_size 8
Feel free to give any suggestions. I’ll update if I get more interesting results.