Cannot train Wav2Vec2 processor with Wav2Vec2 or HuBERT

Hello everyone, I tried to build a processor for Wav2Vec2 and HuBERT following up this blog post, but my WER was ~ 0.99 all the time. Does anyone know how to deal with it?

Here are some config file I have

# preprocessor_config.json
{
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "processor_class": "Wav2Vec2Processor",
  "return_attention_mask": true,
  "sampling_rate": 16000
}
# special_tokens_map.json
{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
# tokenizer_config.json
{"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "<pad>", "do_lower_case": false, "word_delimiter_token": "|", "replace_word_delimiter_char": " ", "tokenizer_class": "Wav2Vec2CTCTokenizer", "processor_class": "Wav2Vec2Processor"}
# vocab.json
{"n": 0, "v": 1, "q": 2, "'": 3, "t": 4, "y": 5, "c": 6, "d": 7, "x": 8, "e": 10, "f": 11, "o": 12, "u": 13, "g": 14, "h": 15, "m": 16, "s": 17, "i": 18, "z": 19, "r": 20, "w": 21, "a": 22, "l": 23, "j": 24, "b": 25, "p": 26, "k": 27, "|": 9, "<unk>": 28, "<pad>": 29}

Here is the training argument I used for fine tuning

batch = 4
epoch = 8
# load processor with all config files
processor = Wav2Vec2Processor.from_pretrained("../processor")
print(f"------------ batch {batch}, epoch {epoch} ----------------")
model = AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-base", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id)
model.freeze_feature_extractor()
training_args = TrainingArguments(
    group_by_length=True,
    per_device_train_batch_size=batch,
    evaluation_strategy="steps",
    num_train_epochs=epoch,
    fp16=True,
    gradient_checkpointing=True,
    save_steps=100,
    eval_steps=100,
    logging_steps=100,
    learning_rate=1e-4,
    weight_decay=0.005,
    warmup_steps=50,
    save_total_limit=2,
    logging_dir='../logs',
    data_seed=42,
    metric_for_best_model="wer",
    greater_is_better=False,
    seed=42,
    load_best_model_at_end=True)
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=wls_dt["train"],
    eval_dataset=wls_dt["test"],
    tokenizer=processor.feature_extractor)
trainer.train()

data_collator and compute_metrics are exact the same from the blog post. Thanks!

I was stuck in a similar position. For Hubert at least, if you are using Hubert-base which is a smaller model (90M params), the training takes time to show results. For me, on librispeech clean, it took around 25 epochs at a learning rate of 1e-5 for the WER to go down from 0.99. Although, I was using a smaller dataset, only the validation set for the training and if you are using, for example, train.100, it should be in fewer number of epochs.

Yes I got it solved! Double check the pre-processing and metafile - in my case it was wrongly aligned audio-text pairs

That’s interesting. As I mentioned above, I found that I wasn’t training for enough time.