I am trying to pre-train the Wav2Vec2-Conformer model available on the Hugging Face Transformers library. During the training process, I am computing the loss using both masked indexes and sampled negative indexes (100 distractors). However, I have noticed that the model’s contrastive loss is around 2000, while the diversity loss is around 200. I am currently testing this training on Mozilla CV 8.0.
I am writing to ask if these losses are acceptable, or if there is something that I might be missing or doing wrong. I would appreciate any feedback or suggestions that you might have regarding this issue.
Thank you for your time and help.