BERT for Speech

How can I use HF’s BERT models for speech-to-text training?

Not easily.

BERT expects tokenized inputs, where natural language text has been coded (tokenized) as numbers. To use BERT for speech, you would need to convert your audio to similar tokens.

If you want to use a pre-trained BERT model, then you would need to use exactly the same tokens. If you want to train a BERT model from scratch then you could define your own tokens.

To learn more about tokenizing, try this BERT Word Embeddings Tutorial · Chris McCormick