Hi
I have taken hubert-large-ls960-ft model and fine tuned it for my dataset. I can see the loss and wer decreasing at each step and the best model is saved. Now when i try to do inference i get repeated characters like ‘zhzhzhzzhzhz’. I am not sure where I am going wrong because I used the same script (Fine-Tune Wav2Vec2 for English ASR in Hugging Face with 🤗 Transformers) for my other wav2vec2 models and they work fine. I have just replaced Wav2Vec2ForCTC with HubertForCTC and the entire script is same. Can anyone please help me
Training progression is as follows:-
And the code snippet where I made the change is as follows:-
from transformers import HubertForCTC
model = HubertForCTC.from_pretrained(
"facebook/hubert-large-ls960-ft",
attention_dropout=0.01,
feat_proj_dropout= 0.05,
activation_dropout=0.05,
hidden_dropout=0.05,
hidden_act= "gelu",
# feat_proj_dropout=0.0,
mask_time_prob=0.1,
layerdrop=0.05,
gradient_checkpointing=True,
ctc_loss_reduction="mean",
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer)
)
Inference script is as follows:-
from transformers import HubertForCTC,Wav2Vec2Processor
model = HubertForCTC.from_pretrained('/content/drive/MyDrive/finalmodel/checkpoint-9500').to("cuda")
processor = Wav2Vec2Processor.from_pretrained("/content/drive/MyDrive/finalmodel")
test_df['audio_path'] = '/content/test_audio/'+test_df['Clip_ID']+'.mp3'
test_df1 = test_df[['audio_path']]
test_data = Dataset.from_pandas(test_df1)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["audio_path"])
batch["speech"] = librosa.resample(np.asarray(speech_array[0].numpy()), 32_000, 16_000)
batch["sampling_rate"] = 16_000
return batch
test_data = test_data.map(speech_file_to_array_fn, remove_columns=test_data.column_names)
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_data.map(evaluate, batched=True, batch_size=8)
result["pred_strings"]