I am going through Fine-Tune XLSR-Wav2Vec2 on Turkish ASR with Transformers blog. Everything is ok when on Colab. But when I try the same code/data on Apple M2 Max, I get WER 1.0, predicting empty strings. Resampling with Audio to 16000 produces different values for the elements of audio arrays on Colab vs M2 Max. This is very frustrating, since I cannot use M2 Max to do local trainings. Any idea how can this be improved?
When I do the training only on cpu (use_cpu=True in trainings_arg), the produced WER is the same, but it goes much slower. It seems it has to do with Pytorch and Apple’s MPS/GPU. Any idea how to employ MPS to produce correct training?