XLSR-53: To group tokens or not to group tokens

In @patrickvonplaten 's Fine Tuning XLSR-53 notebook, he mention how tokens shall not be grouped when computing metrics, in the case of that notebook, the WER metric. And that does make sense. However, later on in the notebook, he goes on to use the processor to decode the predictions and doesn’t pass the group_tokens=False argument to the method.

Shouldn’t the way we decode to compute metrics and to output predictions be the same? Which way would be the correct one? This is probably a minor issue for languages that don’t duplicate graphemes that often, but I’m curious as it could impact the perceived performance one way or another.

Could someone clarify this for me?

Hey @jjdv,

Could you check whether this issue answers your question: wav2vec2: `convert_tokens_to_string` contracts legitimately repeated characters · Issue #10619 · huggingface/transformers · GitHub?