Hubert ASR Fine Tuning giving weird results

Hi
I have taken hubert-large-ls960-ft model and fine tuned it for my dataset. I can see the loss and wer decreasing at each step and the best model is saved. Now when i try to do inference i get repeated characters like ‘zhzhzhzzhzhz’. I am not sure where I am going wrong because I used the same script (Fine-Tune Wav2Vec2 for English ASR in Hugging Face with 🤗 Transformers) for my other wav2vec2 models and they work fine. I have just replaced Wav2Vec2ForCTC with HubertForCTC and the entire script is same. Can anyone please help me

Training progression is as follows:-

And the code snippet where I made the change is as follows:-


from transformers import HubertForCTC

model = HubertForCTC.from_pretrained(

    "facebook/hubert-large-ls960-ft",

    attention_dropout=0.01,

    feat_proj_dropout= 0.05,

    activation_dropout=0.05,

    hidden_dropout=0.05,

    hidden_act= "gelu",

    # feat_proj_dropout=0.0,

    mask_time_prob=0.1,

    layerdrop=0.05,

    gradient_checkpointing=True,

    ctc_loss_reduction="mean",

    pad_token_id=processor.tokenizer.pad_token_id,

    vocab_size=len(processor.tokenizer)

)

Inference script is as follows:-


from transformers import HubertForCTC,Wav2Vec2Processor

model = HubertForCTC.from_pretrained('/content/drive/MyDrive/finalmodel/checkpoint-9500').to("cuda")

processor = Wav2Vec2Processor.from_pretrained("/content/drive/MyDrive/finalmodel")

test_df['audio_path'] = '/content/test_audio/'+test_df['Clip_ID']+'.mp3'

test_df1 = test_df[['audio_path']]

test_data  = Dataset.from_pandas(test_df1)

def speech_file_to_array_fn(batch):

    speech_array, sampling_rate = torchaudio.load(batch["audio_path"])

    batch["speech"] = librosa.resample(np.asarray(speech_array[0].numpy()), 32_000, 16_000)

    batch["sampling_rate"] = 16_000

    return batch

test_data = test_data.map(speech_file_to_array_fn, remove_columns=test_data.column_names)

def evaluate(batch):

    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    

    with torch.no_grad():

        logits = model(inputs.input_values.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)

    batch["pred_strings"] = processor.batch_decode(pred_ids)

    return batch

result = test_data.map(evaluate, batched=True, batch_size=8)

result["pred_strings"]

Hey @sammy786,

Could you please upload all relevant files that are created during training to the hub so that I can take a look. I especially need to take a look at the tokenizer that was created when running the script.

So please upload all files that are required to run the model in inference and all bash and Python scripts that you used for training to a repository to the Hub. There is no way that I could find out from just looking at screenshots what might be wrong here. Thank you!