Hubert ASR Fine Tuning giving weird results

sammy786 · January 11, 2022, 8:02am

Hi
I have taken hubert-large-ls960-ft model and fine tuned it for my dataset. I can see the loss and wer decreasing at each step and the best model is saved. Now when i try to do inference i get repeated characters like ‘zhzhzhzzhzhz’. I am not sure where I am going wrong because I used the same script (Fine-Tune Wav2Vec2 for English ASR in Hugging Face with 🤗 Transformers) for my other wav2vec2 models and they work fine. I have just replaced Wav2Vec2ForCTC with HubertForCTC and the entire script is same. Can anyone please help me

Training progression is as follows:-

And the code snippet where I made the change is as follows:-


from transformers import HubertForCTC

model = HubertForCTC.from_pretrained(

    "facebook/hubert-large-ls960-ft",

    attention_dropout=0.01,

    feat_proj_dropout= 0.05,

    activation_dropout=0.05,

    hidden_dropout=0.05,

    hidden_act= "gelu",

    # feat_proj_dropout=0.0,

    mask_time_prob=0.1,

    layerdrop=0.05,

    gradient_checkpointing=True,

    ctc_loss_reduction="mean",

    pad_token_id=processor.tokenizer.pad_token_id,

    vocab_size=len(processor.tokenizer)

)

Inference script is as follows:-


from transformers import HubertForCTC,Wav2Vec2Processor

model = HubertForCTC.from_pretrained('/content/drive/MyDrive/finalmodel/checkpoint-9500').to("cuda")

processor = Wav2Vec2Processor.from_pretrained("/content/drive/MyDrive/finalmodel")

test_df['audio_path'] = '/content/test_audio/'+test_df['Clip_ID']+'.mp3'

test_df1 = test_df[['audio_path']]

test_data  = Dataset.from_pandas(test_df1)

def speech_file_to_array_fn(batch):

    speech_array, sampling_rate = torchaudio.load(batch["audio_path"])

    batch["speech"] = librosa.resample(np.asarray(speech_array[0].numpy()), 32_000, 16_000)

    batch["sampling_rate"] = 16_000

    return batch

test_data = test_data.map(speech_file_to_array_fn, remove_columns=test_data.column_names)

def evaluate(batch):

    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    

    with torch.no_grad():

        logits = model(inputs.input_values.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)

    batch["pred_strings"] = processor.batch_decode(pred_ids)

    return batch

result = test_data.map(evaluate, batched=True, batch_size=8)

result["pred_strings"]

patrickvonplaten · January 14, 2022, 11:59am

Hey @sammy786,

Could you please upload all relevant files that are created during training to the hub so that I can take a look. I especially need to take a look at the tokenizer that was created when running the script.

So please upload all files that are required to run the model in inference and all bash and Python scripts that you used for training to a repository to the Hub. There is no way that I could find out from just looking at screenshots what might be wrong here. Thank you!

Topic		Replies	Views
Does HuBERT need text as well as audio for fine-tuning? / How to achieve sub-5% WER? Beginners	4	3917	March 18, 2022
Wav2Vec2 ASR Fine tuneing Improvement Beginners	0	174	November 7, 2023
Wav2Vec2: loss growing in training and validation after few epochs Models	6	2035	September 25, 2024
While finetuning w2v-bert, the WER is not decreasing Beginners	2	104	September 30, 2024
Cannot train Wav2Vec2 processor with Wav2Vec2 or HuBERT Beginners	3	381	July 17, 2024

Hubert ASR Fine Tuning giving weird results

Related topics