I hope I can ask this question, despite it not being on the huggingface package. I am also not sure if I am asking this in the right forum. Please indicate if I am in the wrong place.
I am trying to decode the wav2vec2 logits with pyctcdecode (see: GitHub - kensho-technologies/pyctcdecode: A fast and lightweight python-based CTC beam search decoder for speech recognition.) instead of the greedy decoder of huggingface wav2vec2. The output looks great (oftentimes better than the wav2vec2 decoder), however, it sometimes misses a lot of spaces generating output looking like
- eurymembersaregoingtofocusintheirquestionsiamassuremdmthatafteryuelectionthecooperation
where the reference text is - juri members are going to focus in their questions i am sure madam that after your election the cooperation
Any suggestions for the root of this problem and possibly how to fix it?
My setup:
- wav2vec2 model and processor: facebook/wav2vec2-base-10k-voxpopuli-ft-e
- arpa model: voxpopuli_en_5gram_lm downloaded from: voxpopuli/README.md at main · facebookresearch/voxpopuli · GitHub
EDIT: since I posted this question someone mentioned me this issue which might contain the answer will start investigating now.: Unexpected spacing with Huggingface wav2vec library · Issue #25 · kensho-technologies/pyctcdecode · GitHub