What is the language modeling loss (for next-token prediction) for HuBERT model?

I came across the HubertForCTC document, it says that the loss for this specific model is the language model loss for the next token. Can someone explain what is this “next token” in ASR model?

For example, we have a sentence “The quick brown fox jumps over the lazy dog”, and language model losses (for the next token) from other NLP models (i.e., BERT, GPT-2) are related to the next predicted “word” w.r.t. this sentence. Is the next token language model loss for ASR model also related the next predicted “word”? Or it’s actually the next predicted frame from the input wav, which is associated with a character?

Thanks in advance!