Detecting silences and noises in the audio

Hi,
I have created a custom XSLR ASR model using facebook/wav2vec2-xls-r-300m for Turkish.
It can decode audios pretty well.
Previously we have been using kaldi and in our corpus we have some custom tags like <silence> for silence, <noise> for noisy parts in the audio. I wonder how I can achieve this with xlsr?
I have just fine tuned my model with some (around 7K samples) that have and tags in it. I added those tags to the vocab and then I started a train.
with the 3rd checkpoint I got the following output for my audio test:

<nsn>w<nsn>e<nsn>ë<nsn>r<nsn>ö<nsn>f<nsn>e<nsn><sil><nsn>o<nsn>e<nsn>r<nsn>ö<nsn>hi<nsn>iv<nsn>p<nsn>m<nsn>ö<nsn>qmp<nsn>pix<nsn>z<nsn>i<nsn>o<nsn>mp<nsn>pi<nsn>v<nsn>m<nsn>ö<nsn>l<nsn>i<nsn>t<nsn>mr<nsn>mî<nsn>m<nsn>ö<nsn>w<nsn>e<nsn>ë<nsn>k<nsn><nsn>ë<nsn>p<nsn>e<nsn>ö<nsn>w<nsn>i<nsn>p<nsn>eq<nsn>p<nsn>ës<nsn>v<nsn>y<nsn>q<nsn>

the ground truth text is: 
 <sil> sayın başkan değerli milletvekilleri hepinizi saygıyla selamlıyorum <sil>
Note:  <sil>  silence tag
          <spn>  speaker noise (like ıııı, eeee, mmm, etc)
           <nsn> non speaker noise (external noise that is generated by external effects (such as : horn sounds, beeps, any non-human sounds)
my new vocab with the tags is: 
 {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, 'â': 27, 'ç': 28, 'ë': 29, 'î': 30, 'ö': 31, 'ü': 32, 'ğ': 33, 'ı': 34, 'ş': 35, '⁇': 36, "'": 37, '[UNK]': 38, '[PAD]': 39, '|': 0, '<sil>': 40, '<spn>': 41, '<nsn>': 42, '<s>': 43, '</s>': 44}

Looks like there is a shift between the actual and the expected results.
Well I can re-map the characters however, I am not sure if I am on the correct path.
I have a few questions that I want to consult with you:
1- Is XLSR (which is based on wav2vec2) is able to detect the silence and noise parts in the given audio and map it correctly with the tags.
e.g:

1.1- really. <sil> well I don't believe <sil>
1.2- <spn>  <nsn>  could you please stop that noise <nsn>

2- Do I need to feed more corpus data with tags for the training in order to over-weigh the previousy given data without tags?
3- Is there a built-in method for detecting silence and noise in the given audios?
4- Should I wait until the new training process progresses untill some 10 - 20 epochs ?

I would appreciate if someone can clear up this subject and provide some guidance.
Thanks in advance.
Yilmaz A.