I am using Meta’s new MMS model alongside a language model developed by Meta to transcribe some long form Amharic audio. As you can see in the code from this space, a beam-search-decoder is built from ‘torchaudio.models.ctc_decoder’. Because I want to utilize chunking and striding found in the ASR pipeline from the Transformers library, I have been trying to use this same decoder via the ASR pipeline by passing it in as an attribute. For reference, this is the code used to build the decoder:
from torchaudio.models.decoder import ctc_decoder
import json
from huggingface_hub import hf_hub_download
lm_decoding_config = {}
lm_decoding_configfile = hf_hub_download(
repo_id="facebook/mms-cclms",
filename="decoding_config.json",
subfolder="mms-1b-all",
)
with open(lm_decoding_configfile) as f:
lm_decoding_config = json.loads(f.read())
# allow language model decoding for "eng"
decoding_config = lm_decoding_config["eng"]
lm_file = hf_hub_download(
repo_id="facebook/mms-cclms",
filename=decoding_config["lmfile"].rsplit("/", 1)[1],
subfolder=decoding_config["lmfile"].rsplit("/", 1)[0],
)
token_file = hf_hub_download(
repo_id="facebook/mms-cclms",
filename=decoding_config["tokensfile"].rsplit("/", 1)[1],
subfolder=decoding_config["tokensfile"].rsplit("/", 1)[0],
)
lexicon_file = None
if decoding_config["lexiconfile"] is not None:
lexicon_file = hf_hub_download(
repo_id="facebook/mms-cclms",
filename=decoding_config["lexiconfile"].rsplit("/", 1)[1],
subfolder=decoding_config["lexiconfile"].rsplit("/", 1)[0],
)
beam_search_decoder = ctc_decoder(
lexicon=lexicon_file,
tokens=token_file,
lm=lm_file,
nbest=1,
beam_size=500,
beam_size_token=50,
lm_weight=float(decoding_config["lmweight"]),
word_score=float(decoding_config["wordscore"]),
sil_score=float(decoding_config["silweight"]),
blank_token="<s>",
)
Because this decoder is not of type ‘BeamSearchDecoderCTC’ from ‘pyctcdecode’, it is not supported by the ASR pipeline. My question is this: what is the wisest way to make use of the language models developed by Meta in this pipeline? Should I try to build a Wav2Vec2ProcessorWithLM using the 5gram.bin file provided for the CC LM, or would it be prudent to add support for torchaudio.models.ctc_decoder? Or is there another wise option to go after? For now I have jerryrigged the ASR pipeline to use ctc_decoder instead of BeamSearchDecoderCTC, but that doesn’t seem like a wise long-term solution.