Finetunig of wav2vec2-xls-r-300m outputs invalid words for Bengali data

I have used wav2vec2 pretrained model of wav2vec2-xls-r-300m, and finetuned it to 1000hrs Bengali dataset. Training took 4 full days with 20 epochs. But, there is issue in decoding. It is decoding in some arbitrary fashion, basically outputs random combination of Bengali letters (which does not have any meaning as confirmed by Bengali natives). It is showing a WER of 100% for all the sentences.

My code is based on the notebook at Google Colab
@sanchit-gandhi Pls suggest on what could have gone wrong. Should I use fairseq & redo the experiments?
203315605-d0453531-b0f6-47ad-8e6c-f04c94409938

================ Code snippet ===============

Code below

def compute_metrics(pred):
tokenizer = Wav2Vec2CTCTokenizer(vocabFile, unk_token=“[UNK]”, pad_token=“[PAD]”, word_delimiter_token=“|”)

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

#The feature extractor and tokenizer are wrapped into a single Wav2Vec2Processor class 
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)
pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
pred_str = processor.batch_decode(pred_ids)
# we do not want to group tokens when computing the metrics
label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
wer_metric = load_metric("wer")
wer = wer_metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}

def extract_all_chars(batch):
all_text = " ".join(batch[“transcription”])
vocab = list(set(all_text))
return {“vocab”: [vocab], “all_text”: [all_text]}

def vocabCreation(train_data, vocabFile):
vocabs = train_data.map(extract_all_chars,batched=True,batch_size=-1,keep_in_memory=True,remove_columns=train_data.column_names)
vocab_list = list(set(vocabs[“vocab”][0]) | set(vocabs[“vocab”][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict[“|”] = vocab_dict[" “]
del vocab_dict[” “]
vocab_dict[”[UNK]“] = len(vocab_dict)
vocab_dict[”[PAD]"] = len(vocab_dict)
#vocab_dict[“NOISE”] = len(vocab_dict)
##Save the vocabulary as a json file
with io.open(vocabFile, ‘w’, encoding=‘utf8’) as vocab_file:
#with io.open(vocabFile, ‘w’, encoding=‘utf-16le’) as vocab_file:
json.dump(vocab_dict, vocab_file, ensure_ascii=False)

def readDatasets(csv_file):
df = pd.read_csv(csv_file, sep = “:”, low_memory=False)
transcription = df[‘transcription’]
fullWavFile = df[‘wav’] ## Add absolute path of wave file
data = Dataset.from_pandas(df)
new_features = data.features.copy()
new_features[“wav”] = Audio(sampling_rate = 16000)
data_out = data.cast(new_features)
return data_out #df, transcription, fullWavFile

def prepare_dataset(batch):
#Tokenizer that processes the model’s output format to text. We use the json file to instantiate an object of the Wav2Vec2CTCTokenizer class
tokenizer = Wav2Vec2CTCTokenizer(vocabFile, unk_token=“[UNK]”, pad_token=“[PAD]”, word_delimiter_token=“|”)
#Feature extractor that processes the speech signal to the model’s input format
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)
#The feature extractor and tokenizer are wrapped into a single Wav2Vec2Processor class
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
#load and resample the audio data
audio = batch[“wav”]
#batched output is “un-batched” to ensure mapping is correct
#extract the input_values from the loaded audio file
batch[“input_values”] = processor(audio[“array”], sampling_rate=audio[“sampling_rate”]).input_values[0]
batch[“input_length”] = len(batch[“input_values”])
#encode the transcriptions to label ids
with processor.as_target_processor():
batch[“labels”] = processor(batch[“transcription”]).input_ids
return batch

train_data = readDatasets(trainFile)
vocabCreation(train_data, vocabFile)
tokenizer = Wav2Vec2CTCTokenizer(vocabFile, unk_token=“[UNK]”, pad_token=“[PAD]”, word_delimiter_token=“|”)

#Feature extractor that processes the speech signal to the model’s input format
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

#The feature extractor and tokenizer are wrapped into a single Wav2Vec2Processor class
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

train_ds = train_data.map(prepare_dataset, remove_columns=train_data.column_names)
valid_data = readDatasets(validFile)
valid_ds = valid_data.map(prepare_dataset, remove_columns=valid_data.column_names)

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

wv_model = Wav2Vec2ForCTC.from_pretrained(
“facebook/wav2vec2-xls-r-300m”,
#“facebook/wav2vec2-xls-r-1b”,
attention_dropout=0.0,
hidden_dropout=0.0,
feat_proj_dropout=0.0,
mask_time_prob=0.05,
layerdrop=0.0,
ctc_loss_reduction=“mean”,
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer)+2,
)

#set the requires_grad to False for all parameters of the feature extraction part
wv_model.freeze_feature_extractor()

logDir = ‘/exp/logs/’

training_args = TrainingArguments(
output_dir = logDir,
group_by_length=True,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
evaluation_strategy=“steps”,
num_train_epochs=20,
gradient_checkpointing=True,
fp16=True,
save_steps=500,
eval_steps=500,
logging_steps=500,
learning_rate=3e-4,
warmup_steps=500,
save_total_limit=2,
push_to_hub=False,
#report_to=“none”,
)

#All the instances can be passed to the Trainer and we are ready to start the training
trainer = Trainer(
model=wv_model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_ds,
eval_dataset=valid_ds,
tokenizer=processor.feature_extractor,
)

os.environ[“WANDB_DISABLED”] = “true”
#Training
trainer.train()
modelSavePath = ‘/exp/models/xlsr_300m_ft_bn_model-full/’

trainer.save_model(modelSavePath)

Thanks

@patrickvonplaten, @anton-l Pls suggest

Hey @manjunathke - thanks for posting on the forum!

Is the dataset you’re using available on the Hugging Face Hub? I don’t think the steps to load the training / eval data are included in your code snippet! Meaning I can’t reproduce this run.

To me, it could well be a data issue. It’s very hard to overfit with a CTC system. But we can see from your eval loss curve that the eval loss increases right from the start. This suggests to me that something is wrong with the data.

Looking deeper at your code, I’m not sure you’re correctly loading your audio data. Specifically the function readDatasets(csv_file). Without the csv file, I cannot say what the correct way of loading the data is. Could you listen to a few samples of your training / test data and make sure they sound as they should?

You can do this in a notebook:

import IPython.display as ipd
import random

rand_int = random.randint(0, len(train_ds)-1)

print(train_ds[rand_int]["text"])
ipd.Audio(data=train_ds[rand_int]["audio"]["array"], autoplay=True, rate=16000)

You can also check out creating an audio dataset from this guide: Create an audio dataset

If your dataset is on the Hub, you won’t have to worry about loading from a csv file.

Sure, thanks. Will check & update you

Hey @manjunathke! Hope you were able to fine-tune the model! Let me know if you’re experiencing any issues or have any questions - more than happy to help!

Ya Thanks. I tried it. it works. But, the training crashes abruptly once the duration of data crosses around 150 hours. So, after spending about few weeks in figuring out this, I had to stop there without any solution. Thanks for your response.

Hey @manjunathke! Sorry for the late reply here. That’s super weird that training crashes - it’s not a problem with hard drive disk space?

You can try deactivating group_by_length in your TrainingArguments for very large training datasets - this will speed up how quickly the trainer starts running (see Trainer docs)