Background
I have followed this amazing blog Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers on fine tuning whisper on my dataset and the performance is decent! However, as my dataset is in Bahasa Indonesia and my use case would be to use to as helpline phone chatbot where the users would only speak in Bahasa, I have seen some wrong predictions where the transcribed words are not Bahasa words. As Whisper is trained on a multi-lingo dataset and has translation capabilities, of which I do not really need. This got me thinking to create a new BPETokenizer that is pre trained on Bahasa words only.
Problem
After training with my new customized tokenizer, the performance of the new Whisper model is predicting gibberish and I am not sure how to debug it. Any help or directions would be very greatly appreciated.
Code
Training Tokenizer
# training tokenizer
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=['indo_corpus.txt'],
min_frequency=2,
)
# loading tokenizer
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
old_tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="indonesian", task="transcribe")
tokenizer = WhisperTokenizer(vocab_file='indo-vocab.json',
merges_file='indo-merges.txt',
unk_token='',
bos_token= '<|endoftext|>',
pad_token= '<|endoftext|>',
model_max_length = 1024,
language='indonesian', task='transcribe')
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="id", task="transcribe")
tokenizer.add_special_tokens({
'additional_special_tokens': old_tokenizer.special_tokens_map['additional_special_tokens']
})
Inference
from transformers import pipeline
pipe = pipeline(
task = 'automatic-speech-recognition',
model='checkpoint-4000/',
tokenizer=tokenizer,
device=0)
def transcribe(audio):
max_duration_ms = 30000 # ms
transcription = ''
for i in range(len(audio) // max_duration_ms+1):
if i == len(audio) // max_duration_ms:
sample_audio = audio[i*max_duration_ms:]
else:
sample_audio = audio[i*max_duration_ms: (i+1)*max_duration_ms]
sound_array = np.array(sample_audio.get_array_of_samples(), dtype=np.float32)/ 2**15
text = pipe(sound_array)["text"]
transcription += text
return transcription