Wav2Vec2Phoneme documentation is at Wav2Vec2Phoneme.
It says that output has to be decoded using Wav2Vec2PhonemeCTCTokenizer.
The documentation for huggingface models other=phoneme-recognition
includes a reference to facebook/wav2vec2-xlsr-53-espeak-cv-ft · Hugging Face
The source code there includes the line
processor = Wav2Vec2Processor.from_pretrained(checkpoint)
Execution explodes, complaining about a missing tokenizer. It is likely that the documentation
is incorrect.
I tried instantiating a Wav2Vec2PhonemeCTCTokenizer (using a vocab file in the huggingface cache).
If I’m right, the documentation will need to changed. The download will need to be changed
to provide the vocab_file, too (I fished the json out of the huggingface cache).
tokenizer = Wav2Vec2PhonemeCTCTokenizer(vocab_file=‘wav2vec2-lv-60-espeak-cv-ft-vocab.json’)
processor = Wav2Vec2Processor.from_pretrained(checkpoint, tokenizer)
There is a legal problem with my using this (a requirement for espeak which has a GPL license),
so I want to make sure that the above two lines are correct. Are they?