Eliminating PAD token from wav2vec2 prediction

Hi, I used wav2vec2 to produce ASR for Romanina language. The model predicts everything correctly, but it does not remove the pad token, causing the model to have a bad word error metric. I attached the code for the model prediction and an image with the output. The dataset is not ”timit”, but modzilla common voice. I forgot to change the variable name from an older script.

#Other library imports

from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2FeatureExtractor
from transformers import Wav2Vec2ForCTC
from transformers import Wav2Vec2CTCTokenizer
from datasets import DatasetDict, load_from_disk
from util import DataCollatorCTCWithPadding
from datasets import load_metric
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML
import torch

#We want to see a couple of samples
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)

    df = pd.DataFrame(dataset[picks])
    # display(HTML(df.to_html()))

#Importing the model
print("----Importing the model----")
processor = Wav2Vec2Processor.from_pretrained("wav2vec2-large-xlsr-english-concept")

#Importing the preprocessed dataset
print("----Importing the preprocessed dataset----")
timit = load_from_disk("data")

#Generating the data collector
print("----Generating the data collector----")
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

#Wer metric
wer_metric = load_metric("wer")

#Loading the model
processor = Wav2Vec2Processor.from_pretrained("models/wav2vec2-large-xlsr-english-concept-model-4s/checkpoint-1800")
model = Wav2Vec2ForCTC.from_pretrained("models/wav2vec2-large-xlsr-english-concept-model-4s/checkpoint-1800")

device = torch.device('cuda')

#Mapping from logits to text
def map_to_result(batch):
  with torch.no_grad():
    input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
    logits = model(input_values).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_str"] = processor.batch_decode(pred_ids)[0]
  batch["text"] = processor.decode(batch["labels"], group_tokens=False)

  return batch

results = timit.map(map_to_result, remove_columns=timit.column_names)

print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["text"])))


Do you have any solution for this problem? I have the same with wav2vec2 Turkish on a custom dataset.

I am not sure if you have a solution, but I managed to find one. I realized that the pad_token_id is different when processor is loaded using from_pretrained() function. Then the processor couldn’t replace [PAD] tokens in the predictions.
The main reason is not using save_pretrained() to save processor. I added processor.save_pretrained() after saving the model, which solves the problem.